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Preface 


Introduction 


With the ever increasing amounts of data in electronic form, the need for automated methods 
for data analysis continues to grow. The goal of machine learning is to develop methods that 
can automatically detect patterns in data, and then to use the uncovered patterns to predict 
future data or other outcomes of interest. Machine learning is thus closely related to the fields 
of statistics and data mining, but differs slightly in terms of its emphasis and terminology. This 
book provides a detailed introduction to the field, and includes worked examples drawn from 
application domains such as molecular biology, text processing, computer vision, and robotics. 


Target audience 


This book is suitable for upper-level undergraduate students and beginning graduate students in 
computer science, statistics, electrical engineering, econometrics, or any one else who has the 
appropriate mathematical background. Specifically, the reader is assumed to already be familiar 
with basic multivariate calculus, probability, linear algebra, and computer programming. Prior 
exposure to statistics is helpful but not necessary. 


A probabilistic approach 


This books adopts the view that the best way to make machines that can learn from data is to 
use the tools of probability theory, which has been the mainstay of statistics and engineering for 
centuries. Probability theory can be applied to any problem involving uncertainty. In machine 
learning, uncertainty comes in many forms: what is the best prediction (or decision) given some 
data? what is the best model given some data? what measurement should I perform next? etc. 

The systematic application of probabilistic reasoning to all inferential problems, including 
inferring parameters of statistical models, is sometimes called a Bayesian approach. However, 
this term tends to elicit very strong reactions (either positive or negative, depending on who 
you ask), so we prefer the more neutral term “probabilistic approach”. Besides, we will often 
use techniques such as maximum likelihood estimation, which are not Bayesian methods, but 
certainly fall within the probabilistic paradigm. 

Rather than describing a cookbook of different heuristic methods, this book stresses a princi- 
pled model-based approach to machine learning. For any given model, a variety of algorithms 
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can often be applied. Conversely, any given algorithm can often be applied to a variety of 
models. This kind of modularity, where we distinguish model from algorithm, is good pedagogy 
and good engineering. 

We will often use the language of graphical models to specify our models in a concise and 
intuitive way. In addition to aiding comprehension, the graph structure aids in developing 
efficient algorithms, as we will see. However, this book is not primarily about graphical models; 
it is about probabilistic modeling in general. 


A practical approach 


Nearly all of the methods described in this book have been implemented in a MATLAB software 
package called PMTK, which stands for probabilistic modeling toolkit. This is freely available 
from pmtk3.googlecode.com (the digit 3 refers to the third edition of the toolkit, which is the 
one used in this version of the book). There are also a variety of supporting files, written by other 
people, available at pmtksupport.googlecode.com. These will be downloaded automatically, 
if you follow the setup instructions described on the PMTK website. 

MATLAB is a high-level, interactive scripting language ideally suited to numerical computation 
and data visualization, and can be purchased from www.mathworks.com. Some of the code 
requires the Statistics toolbox, which needs to be purchased separately. There is also a free 
version of Matlab called Octave, available at http://www.gnu.org/software/octave/, which 
supports most of the functionality of MATLAB. Some (but not all) of the code in this book also 
works in Octave. See the PMTK website for details. 

PMTK was used to generate many of the figures in this book; the source code for these figures 
is included on the PMTK website, allowing the reader to easily see the effects of changing the 
data or algorithm or parameter settings. The book refers to files by name, e.g., naiveBayesFit. 
In order to find the corresponding file, you can use two methods: within Matlab you can type 
which naiveBayesFit and it will return the full path to the file; or, if you do not have Matlab 
but want to read the source code anyway, you can use your favorite search engine, which should 
return the corresponding file from the pmtk3.googlecode.com website. 

Details on how to use PMTK can be found on the website, which will be udpated over time. 
Details on the underlying theory behind these methods can be found in this book. 
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1.1 


Introduction 


Machine learning: what and why? 
We are drowning in information and starving for knowledge. — John Naisbitt. 


We are entering the era of big data. For example, there are about 1 trillion web pages!; one 
hour of video is uploaded to YouTube every second, amounting to 10 years of content every 
day’; the genomes of 1000s of people, each of which has a length of 3.8 x 10° base pairs, have 
been sequenced by various labs; Walmart handles more than 1M transactions per hour and has 
databases containing more than 2.5 petabytes (2.5 x 1015) of information (Cukier 2010); and so 
on. 

This deluge of data calls for automated methods of data analysis, which is what machine 
learning provides. In particular, we define machine learning as a set of methods that can 
automatically detect patterns in data, and then use the uncovered patterns to predict future 
data, or to perform other kinds of decision making under uncertainty (such as planning how to 
collect more data!). 

This books adopts the view that the best way to solve such problems is to use the tools 
of probability theory. Probability theory can be applied to any problem involving uncertainty. 
In machine learning, uncertainty comes in many forms: what is the best prediction about the 
future given some past data? what is the best model to explain some data? what measurement 
should I perform next? etc. The probabilistic approach to machine learning is closely related to 
the field of statistics, but differs slightly in terms of its emphasis and terminology”. 

We will describe a wide variety of probabilistic models, suitable for a wide variety of data and 
tasks. We will also describe a wide variety of algorithms for learning and using such models. 
The goal is not to develop a cook book of ad hoc techiques, but instead to present a unified 
view of the field through the lens of probabilistic modeling and inference. Although we will pay 
attention to computational efficiency, details on how to scale these methods to truly massive 
datasets are better described in other books, such as (Rajaraman and Ullman 2011; Bekkerman 
et al. 201). 


l. http: //googleblog.blogspot.com/2008/07/we-knew-web-was-big.html 

2. Source: http: //www.youtube.com/t/press_statistics. 

3. Rob Tibshirani, a statistician at Stanford university, has created an amusing comparison between machine learning 
and statistics, available at http: //www-stat.stanford.edu/~tibs/stat315a/glossary.pdf. 
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It should be noted, however, that even when one has an apparently massive data set, the 
effective number of data points for certain cases of interest might be quite small. In fact, data 
across a variety of domains exhibits a property known as the long tail, which means that a 
few things (e.g., words) are very common, but most things are quite rare (see Section 2.4.6 for 
details). For example, 20% of Google searches each day have never been seen before’. This 
means that the core statistical issues that we discuss in this book, concerning generalizing from 
relatively small samples sizes, are still very relevant even in the big data era. 


Types of machine learning 


Machine learning is usually divided into two main types. In the predictive or supervised 
learning approach, the goal is to learn a mapping from inputs x to outputs y, given a labeled 
set of input-output pairs D = {(x,,y;)}4_,. Here D is called the training set, and N is the 
number of training examples. 

In the simplest setting, each training input x; is a D-dimensional vector of numbers, rep- 
resenting, say, the height and weight of a person. These are called features, attributes or 
covariates. In general, however, x; could be a complex structured object, such as an image, a 
sentence, an email message, a time series, a molecular shape, a graph, etc. 

Similarly the form of the output or response variable can in principle be anything, but 
most methods assume that y; is a categorical or nominal variable from some finite set, 
yi € {1,...,C} (such as male or female), or that y; is a real-valued scalar (such as income 
level). When y; is categorical, the problem is known as classification or pattern recognition, 
and when y; is real-valued, the problem is known as regression. Another variant, known as 
ordinal regression, occurs where label space V has some natural ordering, such as grades A-F. 

The second main type of machine learning is the descriptive or unsupervised learning 
approach. Here we are only given inputs, D = {x;}/_,, and the goal is to find “interesting 
patterns” in the data. This is sometimes called knowledge discovery. This is a much less 
well-defined problem, since we are not told what kinds of patterns to look for, and there is no 
obvious error metric to use (unlike supervised learning, where we can compare our prediction 
of y for a given x to the observed value). 

There is a third type of machine learning, known as reinforcement learning, which is 
somewhat less commonly used. This is useful for learning how to act or behave when given 
occasional reward or punishment signals. (For example, consider how a baby learns to walk.) 
Unfortunately, RL is beyond the scope of this book, although we do discuss decision theory 
in Section 5.7, which is the basis of RL. See e.g., (Kaelbling et al. 1996; Sutton and Barto 1998; 
Russell and Norvig 2010; Szepesvari 2010; Wiering and van Otterlo 2012) for more information 
on RL. 


4. 
http://certifiedknowledge.org/blog/are-search-queries-becoming-even-more-unique-statistic 
s-from-google. 
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D features (attributes) 


Color Shape Size (cm) 
; Blue Square 10 
è | [Red Elipse (2.4 
Red Ellipse 20:7 
(b) 


Figure 1.1 Left: Some labeled training examples of colored shapes, along with 3 unlabeled test cases. 
Right: Representing the training data as an N x D design matrix. Row t represents the feature vector x;. 
The last column is the label, y; € {0, 1}. Based on a figure by Leslie Kaelbling. 


Supervised learning 


We begin our investigation of machine learning by discussing supervised learning, which is the 
form of ML most widely used in practice. 


Classification 


In this section, we discuss classification. Here the goal is to learn a mapping from inputs x 
to outputs y, where y € {1,...,C}, with C being the number of classes. If C = 2, this is 
called binary classification (in which case we often assume y € {0, 1}; if C > 2, this is called 
multiclass classification. If the class labels are not mutually exclusive (e.g., somebody may be 
classified as tall and strong), we call it multi-label classification, but this is best viewed as 
predicting multiple related binary class labels (a so-called multiple output model). When we 
use the term “classification”, we will mean multiclass classification with a single output, unless 
we state otherwise. 

One way to formalize the problem is as function approximation. We assume y = f(x) for 
some unknown function f, and the goal of learning is to estimate the function f given a labeled 
training set, and then to make predictions using 7 = f (x). (We use the hat symbol to denote 
an estimate.) Our main goal is to make predictions on novel inputs, meaning ones that we have 
not seen before (this is called generalization), since predicting the response on the training set 
is easy (we can just look up the answer). 


Example 


As a simple toy example of classification, consider the problem illustrated in Figure 1.1(a). We 
have two classes of object which correspond to labels 0 and 1. The inputs are colored shapes. 
These have been described by a set of D features or attributes, which are stored in an N x D 
design matrix X, shown in Figure 1.1(b). The input features x can be discrete, continuous or a 
combination of the two. In addition to the inputs, we have a vector of training labels y. 

In Figure 1.1, the test cases are a blue crescent, a yellow circle and a blue arrow. None of 
these have been seen before. Thus we are required to generalize beyond the training set. A 
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reasonable guess is that blue crescent should be y = 1, since all blue shapes are labeled 1 in the 
training set. The yellow circle is harder to classify, since some yellow things are labeled y = 1 
and some are labeled y = 0, and some circles are labeled y = 1 and some y = 0. Consequently 
it is not clear what the right label should be in the case of the yellow circle. Similarly, the correct 
label for the blue arrow is unclear. 


The need for probabilistic predictions 


To handle ambiguous cases, such as the yellow circle above, it is desirable to return a probability. 
The reader is assumed to already have some familiarity with basic concepts in probability. If 
not, please consult Chapter 2 for a refresher, if necessary. 

We will denote the probability distribution over possible labels, given the input vector x and 
training set D by p(y|x,D). In general, this represents a vector of length C. (If there are just two 
classes, it is sufficient to return the single number p(y = 1|x, D), since p(y = 1|x,D) + p(y = 
0|x,D) = 1.) In our notation, we make explicit that the probability is conditional on the test 
input x, as well as the training set D, by putting these terms on the right hand side of the 
conditioning bar |. We are also implicitly conditioning on the form of model that we use to make 
predictions. When choosing between different models, we will make this assumption explicit by 
writing p(y|x,D, M), where M denotes the model. However, if the model is clear from context, 
we will drop M from our notation for brevity. 

Given a probabilistic output, we can always compute our “best guess” as to the “true label” 
using 


ĝ = f(x) = argmax p(y = c|x,D) (1.1) 
c= 
This corresponds to the most probable class label, and is called the mode of the distribution 
p(y|x, D); it is also known as a MAP estimate (MAP stands for maximum a posteriori). Using 
the most probable label makes intuitive sense, but we will give a more formal justification for 
this procedure in Section 5.7. 

Now consider a case such as the yellow circle, where p(ĝ|x, D) is far from 1.0. In such a 
case we are not very confident of our answer, so it might be better to say “I don’t know” instead 
of returning an answer that we don’t really trust. This is particularly important in domains 
such as medicine and finance where we may be risk averse, as we explain in Section 5.7. 
Another application where it is important to assess risk is when playing TV game shows, such 
as Jeopardy. In this game, contestants have to solve various word puzzles and answer a variety 
of trivia questions, but if they answer incorrectly, they lose money. In 2011, IBM unveiled a 
computer system called Watson which beat the top human Jeopardy champion. Watson uses a 
variety of interesting techniques (Ferrucci et al. 2010), but the most pertinent one for our present 
purposes is that it contains a module that estimates how confident it is of its answer. The system 
only chooses to “buzz in” its answer if sufficiently confident it is correct. Similarly, Google has a 
system known as SmartASS (ad selection system) that predicts the probability you will click on 
an ad based on your search history and other user and ad-specific features (Metz 2010). This 
probability is known as the click-through rate or CTR, and can be used to maximize expected 
profit. We will discuss some of the basic principles behind systems such as SmartASS later in 
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documents 


Figure 1.2 Subset of size 16242 x 100 of the 20-newsgroups data. We only show 1000 rows, for clarity. 
Each row is a document (represented as a bag-of-words bit vector), each column is a word. The red 
lines separate the 4 classes, which are (in descending order) comp, rec, sci, talk (these are the titles of 
USENET groups). We can see that there are subsets of words whose presence or absence is indicative 
of the class. The data is available from http://cs.nyu.edu/~roweis/data.html. Figure generated by 
newsgroupsVisualize. 


Real-world applications 


Classification is probably the most widely used form of machine learning, and has been used 
to solve many interesting and often difficult real-world problems. We have already mentioned 
some important applciations. We give a few more examples below. 


Document classification and email spam filtering 


In document classification, the goal is to classify a document, such as a web page or email 
message, into one of C classes, that is, to compute p(y = c|x, D), where x is some represen- 
tation of the text. A special case of this is email spam filtering, where the classes are spam 
y =1orham y=0. 

Most classifiers assume that the input vector x has a fixed size. A common way to represent 
variable-length documents in feature-vector format is to use a bag of words representation. 
This is explained in detail in Section 3.4.4.1, but the basic idea is to define 7;; = 1 iff word j 
occurs in document i. If we apply this transformation to every document in our data set, we get 
a binary document x word co-occurrence matrix: see Figure 1.2 for an example. Essentially the 
document classification problem has been reduced to one that looks for subtle changes in the 
pattern of bits. For example, we may notice that most spam messages have a high probability of 
containing the words “buy”, “cheap”, “viagra”, etc. In Exercise 8.1 and Exercise 8.2, you will get 
hands-on experience applying various classification techniques to the spam filtering problem. 
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Figure 1.3 Three types of iris flowers: setosa, versicolor and virginica. Source: http: //www.statlab.u 
ni-heidelberg.de/data/iris/ . Used with kind permission of Dennis Kramb and SIGNA. 


sepal length sepal width petal length petal width 


sepal length 


sepal width 


petal length 


petal width 


Figure 1.4 Visualization of the Iris data as a pairwise scatter plot. The diagonal plots the marginal 
histograms of the 4 features. The off diagonals contain scatterplots of all possible pairs of features. Red 
circle = setosa, green diamond = versicolor, blue star = virginica. Figure generated by fisheririsDemo. 


Classifying flowers 


Figure 1.3 gives another example of classification, due to the statistician Ronald Fisher. The goal 
is to learn to distinguish three different kinds of iris flower, called setosa, versicolor and virginica. 
Fortunately, rather than working directly with images, a botanist has already extracted 4 useful 
features or characteristics: sepal length and width, and petal length and width. (Such feature 
extraction is an important, but difficult, task. Most machine learning methods use features 
chosen by some human. Later we will discuss some methods that can learn good features from 
the data.) If we make a scatter plot of the iris data, as in Figure 1.4, we see that it is easy to 
distinguish setosas (red circles) from the other two classes by just checking if their petal length 


1.2. Supervised learning 7 


true class = 7 true class = 2 true class = 1 true class = 7 true class = 2 true class = 1 
true class = 0 true class = 4 true class = 1 true class = 0 true class = 4 true class = 1 


true class = 4 true class = 9 true class = 5 true class = 4 


(a) (b) 


true class = 9 true class = 5 
a z 


Figure 1.5 (a) First 9 test MNIST gray-scale images. (b) Same as (a), but with the features permuted 
randomly. Classification performance is identical on both versions of the data (assuming the training data 
is permuted in an identical way). Figure generated by shuffledDigitsDemo. 


or width is below some threshold. However, distinguishing versicolor from virginica is slightly 
harder; any decision will need to be based on at least two features. (It is always a good idea 
to perform exploratory data analysis, such as plotting the data, before applying a machine 
learning method.) 


Image classification and handwriting recognition 


Now consider the harder problem of classifying images directly, where a human has not pre- 
processed the data. We might want to classify the image as a whole, e.g., is it an indoors or 
outdoors scene? is it a horizontal or vertical photo? does it contain a dog or not? This is called 
image classification. 

In the special case that the images consist of isolated handwritten letters and digits, for 
example, in a postal or ZIP code on a letter, we can use classification to perform handwriting 
recognition. A standard dataset used in this area is known as MNIST, which stands for “Modified 
National Institute of Standards”. (The term “modified” is used because the images have been 
preprocessed to ensure the digits are mostly in the center of the image.) This dataset contains 
60,000 training images and 10,000 test images of the digits 0 to 9, as written by various people. 
The images are size 28 x 28 and have grayscale values in the range 0 : 255. See Figure 1.5(a) for 
some example images. 

Many generic classification methods ignore any structure in the input features, such as spatial 
layout. Consequently, they can also just as easily handle data that looks like Figure 1.5(b), which 
is the same data except we have randomly permuted the order of all the features. (You will 
verify this in Exercise 1.1.) This flexibility is both a blessing (since the methods are general 
purpose) and a curse (since the methods ignore an obviously useful source of information). We 
will discuss methods for exploiting structure in the input features later in the book. 


5. Available from http: //yann.lecun.com/exdb/mnist/. 
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Figure 1.6 Example of face detection. (a) Input image (Murphy family, photo taken 5 August 2010). Used 
with kind permission of Bernard Diedrich of Sherwood Studios. (b) Output of classifier, which detected 5 
faces at different poses. This was produced using the online demo at http: //demo.pittpatt.com/. The 
classifier was trained on 1000s of manually labeled images of faces and non-faces, and then was applied 
to a dense set of overlapping patches in the test image. Only the patches whose probability of containing 
a face was sufficiently high were returned. Used with kind permission of Pittpatt.com 


Face detection and recognition 


A harder problem is to find objects within an image; this is called object detection or object 
localization. An important special case of this is face detection. One approach to this problem 
is to divide the image into many small overlapping patches at different locations, scales and 
orientations, and to classify each such patch based on whether it contains face-like texture or 
not. This is called a sliding window detector. The system then returns those locations where 
the probability of face is sufficiently high. See Figure 1.6 for an example. Such face detection 
systems are built-in to most modern digital cameras; the locations of the detected faces are 
used to determine the center of the auto-focus. Another application is automatically blurring 
out faces in Google's StreetView system. 

Having found the faces, one can then proceed to perform face recognition, which means 
estimating the identity of the person (see Figure 1.10(a)). In this case, the number of class labels 
might be very large. Also, the features one should use are likely to be different than in the face 
detection problem: for recognition, subtle differences between faces such as hairstyle may be 
important for determining identity, but for detection, it is important to be invariant to such 
details, and to just focus on the differences between faces and non-faces. For more information 
about visual object detection, see e.g., (Szeliski 2010). 


Regression 


Regression is just like classification except the response variable is continuous. Figure 1.7 shows 
a simple example: we have a single real-valued input x; € R, and a single real-valued response 
yi E R. We consider fitting two models to the data: a straight line and a quadratic function. 
(We explain how to fit such models below.) Various extensions of this basic problem can arise, 
such as having high-dimensional inputs, outliers, non-smooth responses, etc. We will discuss 
ways to handle such problems later in the book. 
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degree 1 degree 2 


Figure 1.7 (a) Linear regression on some ld data. (b) Same data with polynomial regression (degree 2). 
Figure generated by linregPolyVsDegree. 


Here are some examples of real-world regression problems. 


e Predict tomorrow's stock market price given current market conditions and other possible 
side information. 


e Predict the age of a viewer watching a given video on YouTube. 


e Predict the location in 3d space of a robot arm end effector, given control signals (torques) 
sent to its various motors. 


e Predict the amount of prostate specific antigen (PSA) in the body as a function of a number 
of different clinical measurements. 


e Predict the temperature at any location inside a building using weather data, time, door 
sensors, etc. 


Unsupervised learning 


We now consider unsupervised learning, where we are just given output data, without any 
inputs. The goal is to discover “interesting structure” in the data; this is sometimes called 
knowledge discovery. Unlike supervised learning, we are not told what the desired output is 
for each input. Instead, we will formalize our task as one of density estimation, that is, we 
want to build models of the form p(x;|0@). There are two differences from the supervised case. 
First, we have written p(x,;|@) instead of p(y;|x;,@); that is, supervised learning is conditional 
density estimation, whereas unsupervised learning is unconditional density estimation. Second, 
X; is a vector of features, so we need to create multivariate probability models. By contrast, 
in supervised learning, y; is usually just a single variable that we are trying to predict. This 
means that for most supervised learning problems, we can use univariate probability models 
(with input-dependent parameters), which significantly simplifies the problem. (We will discuss 
multi-output classification in Chapter 19, where we will see that it also involves multivariate 
probability models.) 

Unsupervised learning is arguably more typical of human and animal learning. It is also 
more widely applicable than supervised learning, since it does not require a human expert to 
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Figure 1.8 (a) The height and weight of some people. (b) A possible clustering using K = 2 clusters. 
Figure generated by kmeansHeightWeight. 


manually label the data. Labeled data is not only expensive to acquire®, but it also contains 
relatively little information, certainly not enough to reliably estimate the parameters of complex 
models. Geoff Hinton, who is a famous professor of ML at the University of Toronto, has said: 


When we're learning to see, nobody's telling us what the right answers are — we just 
look. Every so often, your mother says “that’s a dog”, but that’s very little information. 
You'd be lucky if you got a few bits of information — even one bit per second — that 
way. The brain’s visual system has 1014 neural connections. And you only live for 10° 
seconds. So it’s no use learning one bit per second. You need more like 10° bits per 
second. And there’s only one place you can get that much information: from the input 
itself. — Geoffrey Hinton, 1996 (quoted in (Gorder 2006)). 


Below we describe some canonical examples of unsupervised learning. 


Discovering clusters 


As a canonical example of unsupervised learning, consider the problem of clustering data into 
groups. For example, Figure 1.8(a) plots some 2d data, representing the height and weight of 
a group of 210 people. It seems that there might be various clusters, or subgroups, although 
it is not clear how many. Let denote the number of clusters. Our first goal is to estimate 
the distribution over the number of clusters, p(/|D); this tells us if there are subpopulations 
within the data. For simplicity, we often approximate the distribution p(K|D) by its mode, 
K* = arg maxx p(K|D). In the supervised case, we were told that there are two classes (male 
and female), but in the unsupervised case, we are free to choose as many or few clusters as we 
like. Picking a model of the “right” complexity is called model selection, and will be discussed 
in detail below. 

Our second goal is to estimate which cluster each point belongs to. Let z; € {1,..., K} 
represent the cluster to which data point 7 is assigned. (z; is an example of a hidden or 


6. The advent of crowd sourcing web sites such as Mechanical Turk, (https: //www.mturk.com/mturk/welcome), 
which outsource data processing tasks to humans all over the world, has reduced the cost of labeling data. Nevertheless, 
the amount of unlabeled data is still orders of magnitude larger than the amount of labeled data. 
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Figure 1.9 (a) A set of points that live on a 2d linear subspace embedded in 3d. The solid red line is the 
first principal component direction. The dotted black line is the second PC direction. (b) 2D representation 
of the data. Figure generated by pcaDemo3d. 


latent variable, since it is never observed in the training set.) We can infer which cluster each 
data point belongs to by computing z = argmax, p(z; = k|x;,D). This is illustrated in 
Figure 1.8(b), where we use different colors to indicate the assignments, assuming K = 2. 

In this book, we focus on model based clustering, which means we fit a probabilistic model 
to the data, rather than running some ad hoc algorithm. The advantages of the model-based 
approach are that one can compare different kinds of models in an objective way (in terms of 
the likelihood they assign to the data), we can combine them together into larger systems, etc. 

Here are some real world applications of clustering. 


e In astronomy, the autoclass system (Cheeseman et al. 1988) discovered a new type of star, 
based on clustering astrophysical measurements. 


e In e-commerce, it is common to cluster users into groups, based on their purchasing or 
web-surfing behavior, and then to send customized targeted advertising to each group (see 
e.g., (Berkhin 2006)). 


e In biology, it is common to cluster flow-cytometry data into groups, to discover different 
sub-populations of cells (see e.g., (Lo et al. 2009)). 


Discovering latent factors 


When dealing with high dimensional data, it is often useful to reduce the dimensionality by 
projecting the data to a lower dimensional subspace which captures the “essence” of the data. 
This is called dimensionality reduction. A simple example is shown in Figure 1.9, where we 
project some 3d data down to a 2d plane. The 2d approximation is quite good, since most points 
lie close to this subspace. Reducing to ld would involve projecting points onto the red line in 
Figure 1.9(a); this would be a rather poor approximation. (We will make this notion precise in 
Chapter 12.) 

The motivation behind this technique is that although the data may appear high dimensional, 
there may only be a small number of degrees of variability, corresponding to latent factors. For 
example, when modeling the appearance of face images, there may only be a few underlying 
latent factors which describe most of the variability, such as lighting, pose, identity, etc, as 
illustrated in Figure 1.10. 
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Figure 1.10 a) 25 randomly chosen 64 x 64 pixel images from the Olivetti face database. (b) The mean 
and the first three principal component basis vectors (eigenfaces). Figure generated by pcaIlmageDemo. 


When used as input to other statistical models, such low dimensional representations often 
result in better predictive accuracy, because they focus on the “essence” of the object, filtering 
out inessential features. Also, low dimensional representations are useful for enabling fast 
nearest neighbor searches and two dimensional projections are very useful for visualizing high 
dimensional data. 

The most common approach to dimensionality reduction is called principal components 
analysis or PCA. This can be thought of as an unsupervised version of (multi-output) linear 
regression, where we observe the high-dimensional response y, but not the low-dimensional 
“cause” z. Thus the model has the form z — y; we have to “invert the arrow”, and infer the 
latent low-dimensional z from the observed high-dimensional y. See Section 12.1 for details. 

Dimensionality reduction, and PCA in particular, has been applied in many different areas. 
Some examples include the following: 


e In biology, it is common to use PCA to interpret gene microarray data, to account for the 
fact that each measurement is usually the result of many genes which are correlated in their 
behavior by the fact that they belong to different biological pathways. 


e In natural language processing, it is common to use a variant of PCA called latent semantic 
analysis for document retrieval (see Section 27.2.2). 


e In signal processing (e.g., of acoustic or neural signals), it is common to use ICA (which is a 
variant of PCA) to separate signals into their different sources (see Section 12.6). 


e In computer graphics, it is common to project motion capture data to a low dimensional 
space, and use it to create animations. See Section 15.5 for one way to tackle such problems. 
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Figure 111 A sparse undirected Gaussian graphical model learned using graphical lasso (Section 26.7.2) 
applied to some flow cytometry data (from (Sachs et al. 2005)), which measures the phosphorylation status 
of 1l proteins. Figure generated by ggmLassoDemo. 


Discovering graph structure 


Sometimes we measure a set of correlated variables, and we would like to discover which ones 
are most correlated with which others. This can be represented by a graph G, in which nodes 
represent variables, and edges represent direct dependence between variables (we will make 
this precise in Chapter 10, when we discuss graphical models). We can then learn this graph 
structure from data, i.e., we compute G = argmax p(G|D). 

As with unsupervised learning in general, there are two main applications for learning sparse 
graphs: to discover new knowledge, and to get better joint probability density estimators. We 
now give somes example of each. 


e Much of the motivation for learning sparse graphical models comes from the systems biology 
community. For example, suppose we measure the phosphorylation status of some proteins 
in a cell (Sachs et al. 2005). Figure 1.11 gives an example of a graph structure that was learned 
from this data (using methods discussed in Section 26.7.2). As another example, Smith et al. 
(2006) showed that one can recover the neural “wiring diagram” of a certain kind of bird 
from time-series EEG data. The recovered structure closely matched the known functional 
connectivity of this part of the bird brain. 


e In some cases, we are not interested in interpreting the graph structure, we just want to 
use it to model correlations and to make predictions. One example of this is in financial 
portfolio management, where accurate models of the covariance between large numbers of 
different stocks is important. Carvalho and West (2007) show that by learning a sparse graph, 
and then using this as the basis of a trading strategy, it is possible to outperform (i.e., make 
more money than) methods that do not exploit sparse graphs. Another example is predicting 
traffic jams on the freeway. Horvitz et al. (2005) describe a deployed system called JamBayes 
for predicting traffic flow in the Seattle area; predictions are made using a graphical model 
whose structure was learned from data. 
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Figure 1.12 (a) A noisy image with an occluder. (b) An estimate of the underlying pixel intensities, based 
on a pairwise MRF model. Source: Figure 8 of (Felzenszwalb and Huttenlocher 2006). Used with kind 
permission of Pedro Felzenszwalb. 


Matrix completion 


Sometimes we have missing data, that is, variables whose values are unknown. For example, we 
might have conducted a survey, and some people might not have answered certain questions. 
Or we might have various sensors, some of which fail. The corresponding design matrix will 
then have “holes” in it; these missing entries are often represented by NaN, which stands for 
“not a number”. The goal of imputation is to infer plausible values for the missing entries. This 
is sometimes called matrix completion. Below we give some example applications. 


Image inpainting 


An interesting example of an imputation-like task is known as image inpainting. The goal is 
to “fill in” holes (e.g., due to scratches or occlusions) in an image with realistic texture. This is 
illustrated in Figure 1.12, where we denoise the image, as well as impute the pixels hidden behind 
the occlusion. This can be tackled by building a joint probability model of the pixels, given a 
set of clean images, and then inferring the unknown variables (pixels) given the known variables 
(pixels). This is somewhat like masket basket analysis, except the data is real-valued and spatially 
structured, so the kinds of probability models we use are quite different. See Sections 19.6.2.7 
and 13.8.4 for some possible choices. 


Collaborative filtering 


Another interesting example of an imputation-like task is known as collaborative filtering. A 
common example of this concerns predicting which movies people will want to watch based 
on how they, and other people, have rated movies which they have already seen. The key idea 
is that the prediction is not based on features of the movie or user (although it could be), but 
merely on a ratings matrix. More precisely, we have a matrix X where X(m,u) is the rating 
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Figure 1.13 Example of movie-rating data. Training data is in red, test data is denoted by ?, empty cells 
are unknown. 


(say an integer between 1 and 5, where 1 is dislike and 5 is like) by user u of movie m. Note 
that most of the entries in X will be missing or unknown, since most users will not have rated 
most movies. Hence we only observe a tiny subset of the X matrix, and we want to predict 
a different subset. In particular, for any given user u, we might want to predict which of the 
unrated movies he/she is most likely to want to watch. 

In order to encourage research in this area, the DVD rental company Netflix created a com- 
petition, launched in 2006, with a $IM USD prize (see http://netflixprize.com/). In 
particular, they provided a large matrix of ratings, on a scale of 1 to 5, for ~ 18k movies 
created by ~ 500k users. The full matrix would have ~ 9 x 10° entries, but only about 1% 
of the entries are observed, so the matrix is extremely sparse. A subset of these are used for 
training, and the rest for testing, as shown in Figure 1.13. The goal of the competition was to 
predict more accurately than Netflix’s existing system. On 21 September 2009, the prize was 
awarded to a team of researchers known as “BellKor’s Pragmatic Chaos”. Section 27.6.2 discusses 
some of their methodology. Further details on the teams and their methods can be found at 
http://www.netflixprize.com/community/viewtopic.php?id=1537. 


Market basket analysis 


In commercial data mining, there is much interest in a task called market basket analysis. The 
data consists of a (typically very large but sparse) binary matrix, where each column represents 
an item or product, and each row represents a transaction. We set z;; = 1 if item j was 
purchased on the i'th transaction. Many items are purchased together (e.g., bread and butter), 
so there will be correlations amongst the bits. Given a new partially observed bit vector, 
representing a subset of items that the consumer has bought, the goal is to predict which other 
bits are likely to turn on, representing other items the consumer might be likely to buy. (Unlike 
collaborative filtering, we often assume there is no missing data in the training data, since we 
know the past shopping behavior of each customer.) 

This task arises in other domains besides modeling purchasing patterns. For example, similar 
techniques can be used to model dependencies between files in complex software systems. In 
this case, the task is to predict, given a subset of files that have been changed, which other ones 
need to be updated to ensure consistency (see e.g., (Hu et al. 2010)). 

It is common to solve such tasks using frequent itemset mining, which create association 
rules (see e.g., (Hastie et al. 2009, sec 14.2) for details). Alternatively, we can adopt a probabilistic 
approach, and fit a joint density model p(x,,...,2p) to the bit vectors, see e.g., (Hu et al. 
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Figure 1.14 (a) Illustration of a K-nearest neighbors classifier in 2d for K = 3. The 3 nearest neighbors 
of test point xı have labels 1, 1 and 0, so we predict p(y = 1|x1, D, K = 3) = 2/3. The 3 nearest 
neighbors of test point z2 have labels 0, 0, and 0, so we predict p(y = 1|x2,D, K = 3) = 0/3. (b) 
Illustration of the Voronoi tesselation induced by 1-NN. Based on Figure 4.13 of (Duda et al. 2001). Figure 
generated by knnVoronoi. 


2010). Such models often have better predictive acccuracy than association rules, although they 
may be less interpretible. This is typical of the difference between data mining and machine 
learning: in data mining, there is more emphasis on interpretable models, whereas in machine 
learning, there is more emphasis on accurate models. 


Some basic concepts in machine learning 


In this Section, we provide an introduction to some key ideas in machine learning. We will 
expand on these concepts later in the book, but we introduce them briefly here, to give a flavor 
of things to come. 


Parametric vs non-parametric models 


In this book, we will be focussing on probabilistic models of the form p(y|x) or p(x), depending 
on whether we are interested in supervised or unsupervised learning respectively. There are 
many ways to define such models, but the most important distinction is this: does the model 
have a fixed number of parameters, or does the number of parameters grow with the amount 
of training data? The former is called a parametric model, and the latter is called a non- 
parametric model. Parametric models have the advantage of often being faster to use, but the 
disadvantage of making stronger assumptions about the nature of the data distributions. Non- 
parametric models are more flexible, but often computationally intractable for large datasets. 
We will give examples of both kinds of models in the sections below. We focus on supervised 
learning for simplicity, although much of our discussion also applies to unsupervised learning. 


A simple non-parametric classifier: K -nearest neighbors 


A simple example of a non-parametric classifier is the K nearest neighbor (KNN) classifier. 
This simply “looks at” the K points in the training set that are nearest to the test input x, 
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Figure 1.15 (a) Some synthetic 3-class training data in 2d. (b) Probability of class 1 for KNN with K = 10. 
(c) Probability of class 2. (d) MAP estimate of class label. Figure generated by knnClassifyDemo. 


counts how many members of each class are in this set, and returns that empirical fraction as 
the estimate, as illustrated in Figure 1.14. More formally, 


1 
ply = cx, D, K) = K 5 I(y; = £) (1.2) 
i€ Nx (x,D) 


where Ng (x, D) are the (indices of the) K nearest points to x in D and I(e) is the indicator 
function defined as follows: 


1 if e is true 
Ile) = { 0 if e is false a3) 


This method is an example of memory-based learning or instance-based learning. It can 
be derived from a probabilistic framework as explained in Section 14.7.3. The most common 
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Figure 1.16 Illustration of the curse of dimensionality. (a) We embed a small cube of side s inside a larger 
unit cube. (b) We plot the edge length of a cube needed to cover a given volume of the unit cube as a 
function of the number of dimensions. Based on Figure 2.6 from (Hastie et al. 2009). Figure generated by 
curseDimensionality. 


distance metric to use is Euclidean distance (which limits the applicability of the technique to 
data which is real-valued), although other metrics can be used. 

Figure 1.15 gives an example of the method in action, where the input is two dimensional, we 
have three classes, and K = 10. (We discuss the effect of K below.) Panel (a) plots the training 
data. Panel (b) plots p(y = 1|x, D) where x is evaluated on a grid of points. Panel (c) plots 
p(y = 2|x, D). We do not need to plot p(y = 3|x, D), since probabilities sum to one. Panel (d) 
plots the MAP estimate ĝ(x) = argmax,(y = c|x, D). 

A KNN classifier with K = 1 induces a Voronoi tessellation of the points (see Figure 1.14(b)). 
This is a partition of space which associates a region V(x;) with each point x; in such a way 
that all points in V(x;) are closer to x; than to any other point. Within each cell, the predicted 
label is the label of the corresponding training point. 


The curse of dimensionality 


The KNN classifier is simple and can work quite well, provided it is given a good distance metric 
and has enough labeled training data. In fact, it can be shown that the KNN classifier can come 
within a factor of 2 of the best possible performance if N — oo (Cover and Hart 1967). 

However, the main problem with KNN classifiers is that they do not work well with high 
dimensional inputs. The poor performance in high dimensional settings is due to the curse of 
dimensionality. 

To explain the curse, we give some examples from (Hastie et al. 2009, p22). Consider applying 
a KNN classifier to data where the inputs are uniformly distributed in the D-dimensional unit 
cube. Suppose we estimate the density of class labels around a test point x by “growing” a 
hyper-cube around x until it contains a desired fraction f of the data points. The expected edge 
length of this cube will be ep(f) = f'/?. If D = 10, and we want to base our estimate on 10% 
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Figure 1.17 (a) A Gaussian pdf with mean 0 and variance 1. Figure generated by gaussPlotDemo. (b) 
Visualization of the conditional density model p(y|x, 0) = N(y|wo + wız,o°). The density falls off 
exponentially fast as we move away from the regression line. Figure generated by linregWedgeDemo2. 


of the data, we have e10(0.1) = 0.8, so we need to extend the cube 80% along each dimension 
around x. Even if we only use 1% of the data, we find e;9(0.01) = 0.63: see Figure 1.16. Since 
the entire range of the data is only 1 along each dimension, we see that the method is no longer 
very local, despite the name “nearest neighbor”. The trouble with looking at neighbors that are 
so far away is that they may not be good predictors about the behavior of the input-output 
function at a given point. 


Parametric models for classification and regression 


The main way to combat the curse of dimensionality is to make some assumptions about 
the nature of the data distribution (either p(y|x) for a supervised problem or p(x) for an 
unsupervised problem). These assumptions, known as inductive bias, are often embodied in 
the form of a parametric model, which is a statistical model with a fixed number of parameters. 
Below we briefly describe two widely used examples; we will revisit these and other models in 
much greater depth later in the book. 


Linear regression 


One of the most widely used models for regression is known as linear regression. This asserts 
that the response is a linear function of the inputs. This can be written as follows: 


D 

y(x) aw'xte=) 0 wjx; FE (1.4) 
j=l 

where w?x represents the inner or scalar product between the input vector x and the model's 


weight vector w’, and « is the residual error between our linear predictions and the true 
response. 


7. In statistics, it is more common to denote the regression weights by (3. 
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Figure 1.18 Polynomial of degrees 14 and 20 fit by least squares to 21 data points. Figure generated by 
linregPolyVsDegree. 


We often assume that e has a Gaussian? or normal distribution. We denote this by € ~ 


N (u, 07), where p is the mean and o° is the variance (see Chapter 2 for details). When we plot 
this distribution, we get the well-known bell curve shown in Figure 1.17(a). 

To make the connection between linear regression and Gaussians more explicit, we can rewrite 
the model in the following form: 


plylx, 0) = N(y|u(x), 0? (x)) (1.5) 


This makes it clear that the model is a conditional probability density. In the simplest case, we 
assume /1 is a linear function of x, so u = w/'x, and that the noise is fixed, a? (a) = g°. In 
this case, 9 = (w, g?) are the parameters of the model. 

For example, suppose the input is 1 dimensional. We can represent the expected response as 
follows: 


u(x) = wo + wiz = wx (1.6) 


where wo is the intercept or bias term, w is the slope, and where we have defined the vector 
x = (1,2). (Prepending a constant 1 term to an input vector is a common notational trick which 
allows us to combine the intercept term with the other terms in the model.) If wy is positive, 
it means we expect the output to increase as the input increases. This is illustrated in ld in 
Figure 1.17(b); a more conventional plot, of the mean response vs x, is shown in Figure 1.7(a). 

Linear regression can be made to model non-linear relationships by replacing x with some 
non-linear function of the inputs, @(x). That is, we use 


p(y|x, 0) = N(y|w" (x), 07) (1.7) 


This is known as basis function expansion. For example, Figure 1.18 illustrates the case where 
(x) = [1, £z, £?,..., £1], for d = 14 and d = 20; this is known as polynomial regression. 
We will consider other kinds of basis functions later in the book. In fact, many popular 
machine learning methods — such as support vector machines, neural networks, classification 
and regression trees, etc. — can be seen as just different ways of estimating basis functions 
from data, as we discuss in Chapters 14 and 16. 


8. Carl Friedrich Gauss (1777-1855) was a German mathematician and physicist. 
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Figure 1.19 (a) The sigmoid or logistic function. We have sigm(—o0) = 0, sigm(0) = 0.5, and 
sigm(co) = 1. Figure generated by sigmoidPlot. (b) Logistic regression for SAT scores. Solid black dots 
are the data. The open red circles are the predicted probabilities. The green crosses denote two students 
with the same SAT score of 525 (and hence same input representation x) but with different training labels 
(one student passed, y = 1, the other failed, y = 0). Hence this data is not perfectly separable using just 
the SAT feature. Figure generated by logregSATdemo. 


Logistic regression 


We can generalize linear regression to the (binary) classification setting by making two changes. 
First we replace the Gaussian distribution for y with a Bernoulli distribution’,which is more 
appropriate for the case when the response is binary, y € {0,1}. That is, we use 


P(ylx, w) = Ber(y|u(x)) (1.8) 


where u(x) = E [y|x] = p(y = 1|x). Second, we compute a linear combination of the inputs, 
as before, but then we pass this through a function that ensures 0 < u(x) < 1 by defining 


u(x) = sigm(w7 x) (1.9) 
where sigm(7) refers to the sigmoid function, also known as the logistic or logit function. 
This is defined as 
1 e" 


= 1.10 
en+1 (1.10) 


sigm(7) ê 

1 + exp(—7) 
The term “sigmoid” means S-shaped: see Figure 1.19(a) for a plot. It is also known as a squashing 
function, since it maps the whole real line to [0,1], which is necessary for the output to be 
interpreted as a probability. 


Putting these two steps together we get 
p(ylx, w) = Ber(y|sigm(w*x)) (Ly) 


This is called logistic regression due to its similarity to linear regression (although it is a form 
of classification, not regression!). 


9. Daniel Bernoulli (1700-1782) was a Dutch-Swiss mathematician and physicist. 
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A simple example of logistic regression is shown in Figure 1.19(b), where we plot 
plyi = Lxi, w) = sigm(wo + w12;) (1.12) 


where x; is the SAT! score of student i and y; is whether they passed or failed a class. The 
solid black dots show the training data, and the red circles plot p(y = 1|x;, w), where w are 
the parameters estimated from the training data (we discuss how to compute these estimates in 
Section 8.3.4). 

If we threshold the output probability at 0.5, we can induce a decision rule of the form 


G(x) =1 => ply = 1|x) > 0.5 (1.13) 


By looking at Figure 1.19(b), we see that sigm(wo + wiz) = 0.5 for x ~ 545 = x“. We can 
imagine drawing a vertical line at x = x*; this is known as a decision boundary. Everything to 
the left of this line is classified as a 0, and everything to the right of the line is classified as a 1. 

We notice that this decision rule has a non-zero error rate even on the training set. This 
is because the data is not linearly separable, i.e., there is no straight line we can draw to 
separate the 0s from the ls. We can create models with non-linear decision boundaries using 
basis function expansion, just as we did with non-linear regression. We will see many examples 
of this later in the book. 


Overfitting 


When we fit highly flexible models, we need to be careful that we do not overfit the data, that 
is, we should avoid trying to model every minor variation in the input, since this is more likely 
to be noise than true signal. This is illustrated in Figure 1.18(b), where we see that using a high 
degree polynomial results in a curve that is very “wiggly”. It is unlikely that the true function 
has such extreme oscillations. Thus using such a model might result in accurate predictions of 
future outputs. 

As another example, consider the KNN classifier. The value of K can have a large effect on 
the behavior of this model. When K = 1, the method makes no errors on the training set (since 
we just return the labels of the original training points), but the resulting prediction surface is 
very “wiggly” (see Figure 1.20(a)). Therefore the method may not work well at predicting future 
data. In Figure 1.20(b), we see that using K = 5 results in a smoother prediction surface, 
because we are averaging over a larger neighborhood. As K increases, the predictions becomes 
smoother until, in the limit of K = N, we end up predicting the majority label of the whole 
data set. Below we discuss how to pick the “right” value of K. 


Model selection 


When we have a variety of models of different complexity (e.g., linear or logistic regression 
models with different degree polynomials, or KNN classifiers with different values of K), how 
should we pick the right one? A natural approach is to compute the misclassification rate on 


10. SAT stands for “Scholastic Aptitude Test”. This is a standardized test for college admissions used in the United States 
(the data in this example is from Johnson and Albert 1999, p87). 
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Figure 1.20 Prediction surface for KNN on the data in Figure 1.15(a). (a) K=1. (b) K=5. Figure generated by 
knnClassifyDemo. 


the training set for each method. This is defined as follows: 


N 
en(f,D) = | YOIGO) # ui) a4) 
i=1 


where f(x) is our classifier. In Figure 1.21(a), we plot this error rate vs K for a KNN classifier 
(dotted blue line). We see that increasing K increases our error rate on the training set, because 
we are over-smoothing. As we said above, we can get minimal error on the training set by using 
K = 1, since this model is just memorizing the data. 

However, what we care about is generalization error, which is the expected value of the 
misclassification rate when averaged over future data (see Section 6.3 for details). This can be 
approximated by computing the misclassification rate on a large independent test set, not used 
during model training. We plot the test error vs K in Figure 1.21(a) in solid red (upper curve). 
Now we see a U-shaped curve: for complex models (small K), the method overfits, and for 
simple models (big K), the method underfits. Therefore, an obvious way to pick K is to pick 
the value with the minimum error on the test set (in this example, any value between 10 and 
100 should be fine). 

Unfortunately, when training the model, we don’t have access to the test set (by assumption), 
so we cannot use the test set to pick the model of the right complexity.” However, we can create 
a test set by partitioning the training set into two: the part used for training the model, and a 
second part, called the validation set, used for selecting the model complexity. We then fit all 
the models on the training set, and evaluate their performance on the validation set, and pick 
the best. Once we have picked the best, we can refit it to all the available data. If we have a 
separate test set, we can evaluate performance on this, in order to estimate the accuracy of our 
method. (We discuss this in more detail in Section 6.5.3.) 

Often we use about 80% of the data for the training set, and 20% for the validation set. But 
if the number of training cases is small, this technique runs into problems, because the model 


ll. In academic settings, we usually do have access to the test set, but we should not use it for model fitting or model 
selection, otherwise we will get an unrealistically optimistic estimate of performance of our method. This is one of the 
“golden rules” of machine learning research. 
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Figure 1.21 (a) Misclassification rate vs K in a K-nearest neighbor classifier. On the left, where K is 
small, the model is complex and hence we overfit. On the right, where K is large, the model is simple 
and we underfit. Dotted blue line: training set (size 200). Solid red line: test set (size 500). (b) Schematic 
of 5-fold cross validation. Figure generated by knnClassifyDemo. 


won't have enough data to train on, and we won't have enough data to make a reliable estimate 
of the future performance. 

A simple but popular solution to this is to use cross validation (CV). The idea is simple: we 
split the training data into K folds; then, for each fold k € {1,..., K}, we train on all the 
folds but the k’th, and test on the &’th, in a round-robin fashion, as sketched in Figure 1.21(b). 
We then compute the error averaged over all the folds, and use this as a proxy for the test error. 
(Note that each point gets predicted only once, although it will be used for training K — 1 times.) 
It is common to use K = 5; this is called 5-fold CV. If we set K = N, then we get a method 
called leave-one out cross validation, or LOOCV, since in fold 2, we train on all the data cases 
except for 7, and then test on 7. Exercise 1.3 asks you to compute the 5-fold CV estimate of the 
test error vs K, and to compare it to the empirical test error in Figure 1.21(a). 

Choosing K for a KNN classifier is a special case of a more general problem known as model 
selection, where we have to choose between models with different degrees of flexibility. Cross- 
validation is widely used for solving such problems, although we will discuss other approaches 
later in the book. 


No free lunch theorem 


All models are wrong, but some models are useful. — George Box (Box and Draper 1987, 
p424). 


Much of machine learning is concerned with devising different models, and different algorithms 
to fit them. We can use methods such as cross validation to empirically choose the best method 
for our particular problem. However, there is no universally best model — this is sometimes 
called the no free lunch theorem (Wolpert 1996). The reason for this is that a set of assumptions 
that works well in one domain may work poorly in another. 


12. George Box is a retired statistics professor at the University of Wisconsin. 
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As a consequence of the no free lunch theorem, we need to develop many different types of 
models, to cover the wide variety of data that occurs in the real world. And for each model, 
there may be many different algorithms we can use to train the model, which make different 
speed-accuracy-complexity tradeoffs. It is this combination of data, models and algorithms that 
we will be studying in the subsequent chapters. 


Exercises 


Exercise 1.1 KNN classifier on shuffled MNIST data 


Run mnistiNNdemo and verify that the misclassification rate (on the first 1000 test cases) of MNIST of a 
1-NN classifier is 3.8%. (If you run it all on all 10,000 test cases, the error rate is 3.09%.) Modify the code 
so that you first randomly permute the features (columns of the training and test design matrices), as in 
shuffledDigitsDemo, and then apply the classifier. Verify that the error rate is not changed. 


Exercise 1.2 Approximate KNN classifiers 


Use the Matlab/C++ code at http://people.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN to per- 
form approximate nearest neighbor search, and combine it with mmist1NNdemo to classify the MNIST data 
set. How much speedup do you get, and what is the drop (if any) in accuracy? 


Exercise 1.3 CV for KNN 


Use knnClassifyDemo to plot the CV estimate of the misclassification rate on the test set. Compare this 
to Figure 1.21(a). Discuss the similarities and differences to the test error rate. 


21 


Probability 


Introduction 


Probability theory is nothing but common sense reduced to calculation. — Pierre Laplace, 
1812 


In the previous chapter, we saw how probability can play a useful role in machine learning. In 
this chapter, we discuss probability theory in more detail. We do not have to space to go into 
great detail — for that, you are better off consulting some of the excellent textbooks available 
on this topic, such as Jaynes 2003; Bertsekas and Tsitsiklis 2008; Wasserman 2004). But we will 
briefly review many of the key ideas you will need in later chapters. 

Before we start with the more technical material, let us pause and ask: what is probability? 
We are all familiar with the phrase “the probability that a coin will land heads is 0.5”. But what 
does this mean? There are actually at least two different interpretations of probability. One is 
called the frequentist interpretation. In this view, probabilities represent long run frequencies 
of events. For example, the above statement means that, if we flip the coin many times, we 
expect it to land heads about half the time.! 

The other interpretation is called the Bayesian interpretation of probability. In this view, 
probability is used to quantify our uncertainty about something; hence it is fundamentally 
related to information rather than repeated trials Jaynes 2003). In the Bayesian view, the above 
statement means we believe the coin is equally likely to land heads or tails on the next toss. 

One big advantage of the Bayesian interpretation is that it can be used to model our uncer- 
tainty about events that do not have long term frequencies. For example, we might want to 
compute the probability that the polar ice cap will melt by 2020 CE. This event will happen zero 
or one times, but cannot happen repeatedly. Nevertheless, we ought to be able to quantify our 
uncertainty about this event; based on how probable we think this event is, we will (hopefully!) 
take appropriate actions (see Section 5.7 for a discussion of optimal decision making under 
uncertainty). To give some more machine learning oriented examples, we might have received 
a specific email message, and want to compute the probability it is spam. Or we might have 
observed a “blip” on our radar screen, and want to compute the probability distribution over 
the location of the corresponding target (be it a bird, plane, or missile). In all these cases, the 
idea of repeated trials does not make sense, but the Bayesian interpretation is valid and indeed 


1. Actually, the Stanford statistician (and former professional magician) Persi Diaconis has shown that a coin is about 
51% likely to land facing the same way up as it started, due to the physics of the problem (Diaconis et al. 2007). 


2.2 


2.2.1 


2.2.2 


28 Chapter 2. Probability 


0.75- 4 0.757 

0.57 J 0.5} 

0.25} 4 0.25- 

ERE x Cl 
(a) (b) 


Figure 2.1 (A) a uniform distribution on {1, 2,3, 4}, with p(x = k) = 1/4. (b) a degenerate distribution 
p(x) = 1 if x = 1 and p(x) = 0 if x € {2,3, 4}. Figure generated by discreteProbDistFig. 


quite natural. We shall therefore adopt the Bayesian interpretation in this book. Fortunately, the 
basic rules of probability theory are the same, no matter which interpretation is adopted. 


A brief review of probability theory 


This section is a very brief review of the basics of probability theory, and is merely meant as 
a refresher for readers who may be “rusty”. Readers who are already familiar with these basics 
may safely skip this section. 


Discrete random variables 


The expression p(A) denotes the probability that the event A is true. For example, A might 
be the logical expression “it will rain tomorrow”. We require that 0 < p(A) < 1, where 
p(A) = 0 means the event definitely will not happen, and p(A) = 1 means the event definitely 


will happen. We write p(A) to denote the probability of the event not A; this is defined to 
p(A) = 1 —p(A). We will often write A = 1 to mean the event A is true, and A = 0 to mean 
the event A is false. 

We can extend the notion of binary events by defining a discrete random variable X, which 
can take on any value from a finite or countably infinite set X. We denote the probability of 
the event that X = x by p(X = 2), or just p(x) for short. Here p() is called a probability 
mass function or pmf. This satisfies the properties 0 < p(x) < 1 and `, ex p(x) = 1. 
Figure 2.1 shows two pmf’s defined on the finite state space ¥ = {1,2,3,4,5}. On the left we 
have a uniform distribution, p(x) = 1/5, and on the right, we have a degenerate distribution, 
p(x) = I(x = 1), where I() is the binary indicator function. This distribution represents the 
fact that X is always equal to the value 1, in other words, it is a constant. 


Fundamental rules 


In this section, we review the basic rules of probability. 
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Probability of a union of two events 
Given two events, A and B, we define the probability of A or B as follows: 


P(AV B) = pA) + p(B) — p(AN B) 2D 
= p(A)+ p(B) if A and B are mutually exclusive (2.2) 


Joint probabilities 
We define the probability of the joint event A and B as follows: 
P(A, B) = p(A ^ B) = p(A|B)p(B) (2.3) 


This is sometimes called the product rule. Given a joint distribution on two events p(A, B), 
we define the marginal distribution as follows: 


p(A) = X (A, B) = 9 > p(A|B = b)p(B = b) (2.4) 
b b 


where we are summing over all possible states of B. We can define p(B) similarly. This is 
sometimes called the sum rule or the rule of total probability. 
The product rule can be applied multiple times to yield the chain rule of probability: 


P(X1:d) = p(X1)p(X2|X1)p(X3| Xe, X1)p(X4|X1, X2, X3)...p(Xp|X1:d-1) (2.5) 


where we introduce the Matlab-like notation 1 : D to denote the set {1,2,..., D}. 


Conditional probability 
We define the conditional probability of event A, given that event B is true, as follows: 


P(A, B) 
DB) if p(B) > 0 (2.6) 


p(A|B) = 


Bayes rule 


Combining the definition of conditional probability with the product and sum rules yields Bayes 
rule, also called Bayes Theorem’: 
(X =2,Y =y) p(X = x)p(Y = y|X = x) 


=r = — P = 
p(X = |Y y) pY = y) Sd p(x = x')p(Y = y|X = g’) 2 


Example: medical diagnosis 


As an example of how to use this rule, consider the following medical diagonsis problem. 
Suppose you are a woman in your 40s, and you decide to have a medical test for breast cancer 
called a mammogram. If the test is positive, what is the probability you have cancer? That 
obviously depends on how reliable the test is. Suppose you are told the test has a sensitivity 


2. Thomas Bayes (1702-1761) was an English mathematician and Presbyterian minister. 
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of 80%, which means, if you have cancer, the test will be positive with probability 0.8. In other 
words, 


p(z = lly =1) =0.8 (2.8) 


where x = 1 is the event the mammogram is positive, and y = 1 is the event you have breast 
cancer. Many people conclude they are therefore 80% likely to have cancer. But this is false! It 
ignores the prior probability of having breast cancer, which fortunately is quite low: 


p(y = 1) = 0.004 (2.9) 


Ignoring this prior is called the base rate fallacy. We also need to take into account the fact 
that the test may be a false positive or false alarm. Unfortunately, such false positives are 
quite likely (with current screening technology): 

pe = lly =0) =0.1 (2.10) 
Combining these three terms using Bayes rule, we can compute the correct answer as follows: 

x= l|y=1 =1 
Sha = p(z = lly = ipy = 1) (21) 
p(z = 1y = py = 1) + p(z = lly = 0)p(y = 0) 
0.8 x 0.004 


= = 0.031 2.12 
0.8 x 0.004 + 0.1 x 0.996 di i 


where p(y = 0) = 1 — p(y = 1) = 0.996. In other words, if you test positive, you only have 
about a 3% chance of actually having breast cancer! 


Example: Generative classifiers 


We can generalize the medical diagonosis example to classify feature vectors x of arbitrary type 
as follows: 


wedos ply = c|0)p(x|y = c, 0) (2.13) 


Zo p = c'|0)plxly = e, 0) 
This is called a generative classifier, since it specifies how to generate the data using the class- 
conditional density p(x|y = c) and the class prior p(y = c). We discuss such models in detail 
in Chapters 3 and 4. An alternative approach is to directly fit the class posterior, p(y = c|x); 
this is known as a discriminative classifier. We discuss the pros and cons of the two approaches 
in Section 8.6. 


Independence and conditional independence 


We say X and Y are unconditionally independent or marginally independent, denoted 
X LY, if we can represent the joint as the product of the two marginals (see Figure 2.2), i.e., 


X LY 4> v(X,Y) =p(X)p(Y) (2.14) 
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Figure 2.2 Computing p(x, y) = p(x)p(y), where X L Y. Here X and Y are discrete random variables; 
X has 6 possible states (values) and Y has 5 possible states. A general joint distribution on two such 
variables would require (6 x 5) — 1 = 29 parameters to define it (we subtract 1 because of the sum-to-one 
constraint). By assuming (unconditional) independence, we only need (6 — 1) + (5 — 1) = 9 parameters 
to define p(z, y). 


In general, we say a set of variables is mutually independent if the joint can be written as a 
product of marginals. 

Unfortunately, unconditional independence is rare, because most variables can influence most 
other variables. However, usually this influence is mediated via other variables rather than being 
direct. We therefore say X and Y are conditionally independent (CI) given Z iff the conditional 
joint can be written as a product of conditional marginals: 


X LY|Z <> p(X,Y|Z) = p(X|Z)p(Y|Z) (2.15) 


When we discuss graphical models in Chapter 10, we will see that we can write this assumption 
as a graph X — Z —Y, which captures the intuition that all the dependencies between X and Y 
are mediated via Z. For example, the probability it will rain tomorrow (event X) is independent 
of whether the ground is wet today (event Y), given knowledge of whether it is raining today 
(event Z). Intuitively, this is because Z “causes” both X and Y, so if we know Z, we do not 
need to know about Y in order to predict X or vice versa. We shall expand on this concept in 
Chapter 10. 
Another characterization of CI is this: 


Theorem 2.2.1. X L Y|Z iff there exist function g and h such that 
P(x, y|z) = g(x, z)h(y, 2) (2.16) 


for all x,y,z such that p(z) > 0. 


3. These numbers are from (McGrayne 20H, p257). Based on this analysis, the US government decided not to recommend 
annual mammogram screening to women in their 40s: the number of false alarms would cause needless worry and 
stress amongst women, and result in unnecesssary, expensive, and potentially harmful followup tests. See Section 5.7 
for the optimal way to trade off risk reverse reward in the face of uncertainty. 
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See Exercise 2.8 for the proof. 

CI assumptions allow us to build large probabilistic models from small pieces. We will see 
many examples of this throughout the book. In particular, in Section 3.5, we discuss naive Bayes 
classifiers, in Section 17.2, we discuss Markov models, and in Chapter 10 we discuss graphical 
models; all of these models heavily exploit CI properties. 


Continuous random variables 


So far, we have only considered reasoning about uncertain discrete quantities. We will now show 
(following Jaynes 2003, pl07)) how to extend probability to reason about uncertain continuous 
quantities. 

Suppose X is some uncertain continuous quantity. The probability that X lies in any interval 
a < X < b can be computed as follows. Define the events A = (X < a), B = (X < b) and 
W = (a < X <b). We have that B = A V W, and since A and W are mutually exclusive, the 
sum rules gives 


p(B) = p(A) + p(W) (2.17) 
and hence 
p(W) = p(B) — p(A) (2.18) 


Define the function F(q) = p(X < q). This is called the cumulative distribution function 
or cdf of X. This is obviously a monotonically increasing function. See Figure 2.3(a) for an 
example. Using this notation we have 


pla < X <b) = F(b) — F(a) (2.19) 
Now define f(x) = Æ F(x) (we assume this derivative exists); this is called the probability 
density function or pdf. See Figure 2.3(b) for an example. Given a pdf, we can compute the 
probability of a continuous variable being in a finite interval as follows: 


b 
P(a< X <b)= / f(x)dx (2.20) 


As the size of the interval gets smaller, we can write 
P(x < X < x+ dr) x p(x)dz (2.21) 


We require p(x) > 0, but it is possible for p(x) > 1 for any given z, so long as the density 
integrates to 1. As an example, consider the uniform distribution Unif(a, b): 


z I(a < x < b) (2.22) 


Unif (aa, b) = 7 < 
—a 


If we set a = 0 and b = 3, we have p(x) = 2 for any x € (0, 4]. 
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Figure 2.3 (a) Plot of the cdf for the standard normal, M (0,1). (b) Corresponding pdf. The shaded 
regions each contain a/2 of the probability mass. Therefore the nonshaded region contains 1 — a of the 
probability mass. If the distribution is Gaussian M (0, 1), then the leftmost cutoff point is ®~'(a/2), where 
© is the cdf of the Gaussian. By symmetry, the rightost cutoff point is ~'(1 — a/2) = —6~1(a/2). If 
a = 0.05, the central interval is 95%, and the left cutoff is -1.96 and the right is 1.96. Figure generated by 
quantileDemo. 


Quantiles 


Since the cdf F is a monotonically increasing function, it has an inverse; let us denote this by 
F 1. If F is the cdf of X, then F~1(q) is the value of za such that P(X < xa) = a; this is 
called the a quantile of F. The value F’~'(0.5) is the median of the distribution, with half of 
the probability mass on the left, and half on the right. The values F'~!(0.25) and F'~'(0.75) 
are the lower and upper quartiles. 

We can also use the inverse cdf to compute tail area probabilities. For example, if ® is 
the cdf of the Gaussian distribution M (0, 1), then points to the left of ®~'(a)/2) contain a/2 
probability mass, as illustrated in Figure 2.3(b). By symmetry, points to the right of ®~!(1—a/2) 
also contain «/2 of the mass. Hence the central interval (®~!(a/2), @~1(1 — a/2)) contains 
1 — a of the mass. If we set a = 0.05, the central 95% interval is covered by the range 


(6~1(0.025), 6-1(0.975)) = (—1.96, 1.96) (2.23) 


If the distribution is M (pu, 07), then the 95% interval becomes (u — 1.960, u + 1.960). This is 
sometimes approximated by writing + 20. 


Mean and variance 


The most familiar property of a distribution is its mean, or expected value, denoted by ju. For 
discrete rv’s, it is defined as E [X] = $ex 2 p(x), and for continuous rv’s, it is defined as 


L [X] = fx p(x)dz. If this integral is not finite, the mean is not defined (we will see some 
examples of this later). 
The variance is a measure of the “spread” of a distribution, denoted by a. This is defined 
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as follows: 
var[X] = E[(X -yp)7] = fe — p)*p(x)dx (2.24) 


= J x p(a)dx + p’? J p(x)dx — 2p il ap(x)da = E [X?] — p? (2.25) 


from which we derive the useful result 


a [X?] =p? +07 (2.26) 
The standard deviation is defined as 
std [X] £ vyvar [X] (2.27) 


This is useful since it has the same units as X itself. 


Some common discrete distributions 


In this section, we review some commonly used parametric distributions defined on discrete 
state spaces, both finite and countably infinite. 


The binomial and Bernoulli distributions 


Suppose we toss a coin n times. Let X € {0,...,n} be the number of heads. If the probability 
of heads is 8, then we say X has a binomial distribution, written as X ~ Bin(n, 0). The pmf 
is given by 


Bin(k|n,0) ê (i) g(a =) (2.28) 
where 
EN ok n! 
(i) * wan ™” 


is the number of ways to choose k items from n (this is known as the binomial coefficient, 
and is pronounced “n choose k”). See Figure 2.4 for some examples of the binomial distribution. 
This distribution has the following mean and variance: 


mean = 0, var = n0(1— 0) (2.30) 


Now suppose we toss a coin only once. Let X € {0,1} be a binary random variable, with 
probability of “success” or “heads” of 0. We say that X has a Bernoulli distribution. This is 
written as X ~ Ber(@), where the pmf is defined as 


Ber(z|9) = 6) (1 — pyr) (2.31) 
In other words, 
0 iig=ůņ 
Ber(x|0) = { 1-6 if2=0 (2.32) 


This is obviously just a special case of a Binomial distribution with n = 1. 
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0=0.250 @=0.900 


Figure 2.4 Illustration of the binomial distribution with n = 10 and 8 € {0.25, 0.9}. Figure generated 
by binomDistPlot. 


The multinomial and multinoulli distributions 


The binomial distribution can be used to model the outcomes of coin tosses. To model the 
outcomes of tossing a K-sided die, we can use the multinomial distribution. This is defined as 


follows: let x = (x1,..., £g) be a random vector, where x; is the number of times side j of 
the die occurs. Then x has the following pmf: 
P K 
M 0) ê g” 2.33 
woe (a "a S 2s 
j=l 
where 0; is the probability that side j shows up, and 
j 
m Apn He (2.34) 
Ti... TK xılzo! xg! 


is the multinomial coefficient (the number of ways to divide a set of size n = D £k into 
subsets with sizes zı up to zg). 

Now suppose n = 1. This is like rolling a K-sided dice once, so x will be a vector of 0s 
and ls (a bit vector), in which only one bit can be turned on. Specifically, if the dice shows 
up as face k, then the k’th bit will be on. In this case, we can think of x as being a scalar 
categorical random variable with K states (values), and x is its dummy encoding, that is, 
x = [I(x = 1),...,I(x = K)]. For example, if K = 3, we encode the states 1, 2 and 3 as 
(1,0,0), (0, 1,0), and (0,0, 1). This is also called a one-hot encoding, since we imagine that 
only one of the K “wires” is “hot” or on. In this case, the pmf becomes 


Mu(x|1,0) = J| >? (2.35) 


See Figure 2.1(b-c) for an example. This very common special case is known as a categorical 
or discrete distribution. (Gustavo Lacerda suggested we call it the multinoulli distribution, by 
analogy with the Binomial/ Bernoulli distinction, a term which we shall adopt in this book.) We 
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Name n K rx 

Multinomial - -  x€{0,1,...,n}“5, TA Lp=n 
Multinoulli i ar x€ {0, i, Ys Lp = 1 (l-of-K encoding) 
Binomial - 1 we {0,1,...,n} 


Bernoulli 1 1 we {0,1} 


Table 2.1 Summary of the multinomial and related distributions. 
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Figure 2.5 (a) Some aligned DNA sequences. (b) The corresponding sequence logo. Figure generated by 
seqlogoDemo. 


will use the following notation for this case: 
Cat(a|0) = Mu(x|1, 0) (2.36) 
In otherwords, if x ~ Cat(@), then p(x = j|0) = 6;. See Table 2.1 for a summary. 


Application: DNA sequence motifs 


An interesting application of multinomial models arises in biosequence analysis. Suppose 
we have a set of (aligned) DNA sequences, such as in Figure 2.5(a), where there are 10 rows 
(sequences) and 15 columns (locations along the genome). We see that several locations are con- 
served by evolution (e.g., because they are part of a gene coding region), since the corresponding 
columns tend to be “pure”. For example, column 7 is all G’s. 

One way to visually summarize the data is by using a sequence logo: see Figure 2.5(b). We 
plot the letters A, C, G and T with a fontsize proportional to their empirical probability, and with 
the most probable letter on the top. The empirical probability distribution at location t, 0;, is 
gotten by normalizing the vector of counts (see Equation 3.48 ): 


N; (2.37) 


i 
ea 
Mz 
= 
a 
| 
5E 
Mz 
= 
2s 
| 
oS 
= 
Ps 
| 
cos 
Mz 
= 
= 
| 
Aa 
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This distribution is known as a motif. We can also compute the most probable letter in each 
location; this is called the consensus sequence. 
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Poi(a=1.000) Poi(,=10.000) 


Figure 2.6 Illustration of some Poisson distributions for A € {1,10}. We have truncated the x-axis to 
25 for clarity, but the support of the distribution is over all the non-negative integers. Figure generated by 
poissonPlotDemo. 


The Poisson distribution 


We say that X € {0,1,2,...} has a Poisson distribution with parameter A > 0, written 
X ~ Poi(A), if its pmf is 
-a A7 


Poi(x|à) = e zl 


(2.39) 


The first term is just the normalization constant, required to ensure the distribution sums to 1. 
The Poisson distribution is often used as a model for counts of rare events like radioactive 
decay and traffic accidents. See Figure 2.6 for some plots. 


The empirical distribution 


Given a set of data, D = {x£1,..., £y}, we define the empirical distribution, also called the 
empirical measure, as follows: 


N 
1 
emp(A) = = z (A 2.4 
Panp(A) È 5 YBa 2.40) 
where 6,,(A) is the Dirac measure, defined by 
_j0 ifrgA 
es { 1 ifzeA am 


In general, we can associate “weights” with each sample: 
N 
p(t) = X. wide,(2) (2.42) 
i=1 


where we require 0 < w; < 1 and san wi = 1. We can think of this as a histogram, with 
“spikes” at the data points x;, where w; determines the height of spike i. This distribution 
assigns 0 probability to any point not in the data set. 
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Some common continuous distributions 


In this section we present some commonly used univariate (one-dimensional) continuous prob- 
ability distributions. 


Gaussian (normal) distribution 


The most widely used distribution in statistics and machine learning is the Gaussian or normal 
distribution. Its pdf is given by 
1 1 2 
2) A —552 (@-H) 

N (z|u, 0°) Tone e (2.43) 
Here u = E [X] is the mean (and mode), and o? = var [X] is the variance. v2ro? is the 
normalization constant needed to ensure the density integrates to 1 (see Exercise 2.11). 

We write X ~ N(u,07) to denote that p(X = x) = N(a|pu,07). If X ~ N(0,1), we 
say X follows a standard normal distribution. See Figure 2.3(b) for a plot of this pdf; this is 
sometimes called the bell curve. 

We will often talk about the precision of a Gaussian, by which we mean the inverse variance: 
A = 1/o?. A high precision means a narrow distribution (low variance) centered on ju.’ 

Note that, since this is a pdf, we can have p(x) > 1. To see this, consider evaluating the 
density at its center, x = p. We have N (u|u, o?) = (oV2r)~1e°, so if o < 1/./27, we have 
p(a) > 1. 

The cumulative distribution function or cdf of the Gaussian is defined as 


(zx; u, o°) £ J N(z|u, 07)dz (2.44) 


See Figure 2.3(a) for a plot of this cdf when u = 0, ø? = 1. This integral has no closed form 
expression, but is built in to most software packages. In particular, we can compute it in terms 
of the error function (erf): 


(z; u,0) = m + erf(z/V2)] (2.45) 


where z = (x — u) /c and 


erf(x) = = | et dt (2.46) 


The Gaussian distribution is the most widely used distribution in statistics. There are several 
reasons for this. First, it has two parameters which are easy to interpret, and which capture 
some of the most basic properties of a distribution, namely its mean and variance. Second, 
the central limit theorem (Section 2.6.3) tells us that sums of independent random variables 
have an approximately Gaussian distribution, making it a good choice for modeling residual 
errors or “noise”. Third, the Gaussian distribution makes the least number of assumptions (has 


4. The symbol A will have many different meanings in this book, in order to be consistent with the rest of the literature. 
The intended meaning should be clear from context. 
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maximum entropy), subject to the constraint of having a specified mean and variance, as we 
show in Section 9.2.6; this makes it a good default choice in many cases. Finally, it has a simple 
mathematical form, which results in easy to implement, but often highly effective, methods, as 
we will see. See Jaynes 2003, ch 7) for a more extensive discussion of why Gaussians are so 
widely used. 


Degenerate pdf 


In the limit that c? — 0, the Gaussian becomes an infinitely tall and infinitely thin “spike” 
centered at u: 


lim N (z|u, o°) = 6(a — u) (2.47) 
o?—0 


where ô is called a Dirac delta function, and is defined as 


co ifx=0 
He) ={ 9 ae: (2.48) 
such that 
J ô(x)dx = 1 (2.49) 


A useful property of delta functions is the sifting property, which selects out a single term 
from a sum or integral: 


J7 Foe- wae = Fu) (2.50) 


since the integrand is only non-zero if x — u = 0. 

One problem with the Gaussian distribution is that it is sensitive to outliers, since the log- 
probability only decays quadratically with distance from the center. A more robust distribution 
is the Student ¢ distribution? Its pdf is as follows: 


vy 


L fe = ll 
1+2( e) (2.51) 
Vv on 


2 


T(alp,07,v) x 


where pu is the mean, o~ > 0 is the scale parameter, and v > 0 is called the degrees of 
freedom. See Figure 2.7 for some plots. For later reference, we note that the distribution has 
the following properties: 


(v—2) 


5. This distribution has a colourful etymology. It was first published in 1908 by William Sealy Gosset, who worked at the 
Guinness brewery in Dublin. Since his employer would not allow him to use his own name, he called it the “Student” 
distribution. The origin of the term t seems to have arisen in the context of Tables of the Student distribution, used by 
Fisher when developing the basis of classical statistical inference. See http: //jef£560.tripod.com/s.html1 for more 
historical details. 


mean = u, mode = u, var = (2.52) 
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Figure 2.7 (a) The pdfs for a V(0, 1), 7(0, 1,1) and Lap(0,1/\/2). The mean is 0 and the variance 
is 1 for both the Gaussian and Laplace. The mean and variance of the Student is undefined when v = 1. 
(b) Log of these pdf's. Note that the Student distribution is not log-concave for any parameter value, unlike 
the Laplace distribution, which is always log-concave (and log-convex...) Nevertheless, both are unimodal. 
Figure generated by studentLaplacePdfPlot. 
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Figure 2.8 Illustration of the effect of outliers on fitting Gaussian, Student and Laplace distributions. (a) 
No outliers (the Gaussian and Student curves are on top of each other). (b) With outliers. We see that the 
Gaussian is more affected by outliers than the Student and Laplace distributions. Based on Figure 2.16 of 
(Bishop 2006a). Figure generated by robustDemo. 


The variance is only defined if v > 2. The mean is only defined if v > 1. 

As an illustration of the robustness of the Student distribution, consider Figure 2.8. On the 
left, we show a Gaussian and a Student fit to some data with no outliers. On the right, we 
add some outliers. We see that the Gaussian is affected a lot, whereas the Student distribution 
hardly changes. This is because the Student has heavier tails, at least for small v (see Figure 2.7). 

If v = 1, this distribution is known as the Cauchy or Lorentz distribution. This is notable 
for having such heavy tails that the integral that defines the mean does not converge. 

To ensure finite variance, we require v > 2. It is common to use v = 4, which gives good 
performance in a range of problems (Lange et al. 1989). For v >> 5, the Student distribution 
rapidly approaches a Gaussian distribution and loses its robustness properties. 
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Gamma distributions 3.5 


Figure 2.9 (a) Some Ga(a,b = 1) distributions. If a < 1, the mode is at 0, otherwise it is > 0. As 
we increase the rate b, we reduce the horizontal scale, thus squeezing everything leftwards and upwards. 
Figure generated by gammaPlotDemo. (b) An empirical pdf of some rainfall data, with a fitted Gamma 
distribution superimposed. Figure generated by gammaRainfallDemo. 


The Laplace distribution 


Another distribution with heavy tails is the Laplace distribution®, also known as the double 
sided exponential distribution. This has the following pdf: 


1 — 
Lap(z|u,b) = 5° ( eH) (2.53) 


Here u is a location parameter and b > 0 is a scale parameter. See Figure 2.7 for a plot. This 
distribution has the following properties: 


mean = u, mode = u, var = 2b? (2.54) 


Its robustness to outliers is illustrated in Figure 2.8. It also put mores probability density at 0 
than the Gaussian. This property is a useful way to encourage sparsity in a model, as we will 
see in Section 13.3. 


The gamma distribution 


The gamma distribution is a flexible distribution for positive real valued rv’s, x > 0. It is 
defined in terms of two parameters, called the shape a > 0 and the rate b > 0:7 
be 
Ga(T|shape = a, rate = b) ê ela ee (2.55) 
r (a) 


6. Pierre-Simon Laplace (1749-1827) was a French mathematician, who played a key role in creating the field of Bayesian 
statistics. 

7. There is an alternative parameterization, where we use the scale parameter instead of the rate: Gas(T|a,b) = 
Ga(T|a, 1/b). This version is the one used by Matlab’s gampdf, although in this book will use the rate parameterization 
unless otherwise specified. 
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where T (a) is the gamma function: 
r(x) & | u” te “du (2.56) 
0 


See Figure 2.9 for some plots. For later reference, we note that the distribution has the following 
properties: 
a a=] a 
mean = 5 mode = EA var = Bp (2.57) 
There are several distributions which are just special cases of the Gamma, which we discuss 
below. 


e Exponential distribution This is defined by Expon(x|\) ê Ga(z|1, A), where A is the rate 
parameter. This distribution describes the times between events in a Poisson process, i.e. a 
process in which events occur continuously and independently at a constant average rate A. 

e Erlang distribution This is the same as the Gamma distribution where a is an integer. It 
is common to fix a = 2, yielding the one-parameter Erlang distribution, Erlang(x|\) = 
Ga(a|2, A), where A is the rate parameter. 

e Chi-squared distribution This is defined by x? (x|v) = Ga(x|4, 4). This is the distribution 
of the sum of squared Gaussian random variables. More precisely, if Z; ~ A’(0,1), and 


S= Yai Z?, then S ~ x2. 


Another useful result is the following: If X ~ Ga(a,b), then one can show (Exercise 2.10) 
that + ~ IG(a,b), where IG is the inverse gamma distribution defined by 


b* 
IG(z|shape = a,scale = b) = Tae (2.58) 
The distribution has these properties 


d ? á (2:59) 
mode = var = ; 
a— 1’ atl’ (a — 1)?(a — 2)’ 


mean = 


The mean only exists if a > 1. The variance only exists if a > 2. 
We will see applications of these distributions later on. 


The beta distribution 


The beta distribution has support over the interval [0,1] and is defined as follows: 


1 
Beta(z|a, b) = Ban” l =g (2.60) 
Here B(p, q) is the beta function, 
r (a)T (b) 
B & a 2.61 
(a,b) Tati (2.61) 


See Figure 2.10 for plots of some beta distributions. We require a, b > 0 to ensure the distribution 
is integrable (i.e., to ensure B(a,b) exists). If a = b = 1, we get the uniform distirbution. If 
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beta distributions 


[| =— a=0.1, b=0.1 


K 
w+ =2=1.0, b=1.0 y \ 
25l = = = a=2.0, b=3.0 
— — a=8.0, b=4.0 / \ 


Figure 2.10 Some beta distributions. Figure generated by betaPlotDemo. 


a and b are both less than 1, we get a bimodal distribution with “spikes” at 0 and 1; if a and 
b are both greater than 1, the distribution is unimodal. For later reference, we note that the 
distribution has the following properties (Exercise 2.16): 

a a—1 ab 

mean = ——, mode = (2.62) 


a+b’ “atb? T (at+b(a+b+1) 


Pareto distribution 


The Pareto distribution is used to model the distribution of quantities that exhibit long tails, 
also called heavy tails. For example, it has been observed that the most frequent word in 
English (“the”) occurs approximately twice as often as the second most frequent word (“of”), 
which occurs twice as often as the fourth most frequent word, etc. If we plot the frequency of 
words vs their rank, we will get a power law; this is known as Zipf’s law. Wealth has a similarly 
skewed distribution, especially in plutocracies such as the USA.® 

The Pareto pdf is defined as follow: 


Pareto(a|k,m) = km*¥a~*+I(a > m) (2.63) 


This density asserts that x must be greater than some constant m, but not too much greater, 
where k controls what is “too much”. As k — oo, the distribution approaches 6(a — m). See 
Figure 2.1l(a) for some plots. If we plot the distibution on a log-log scale, it forms a straight 
line, of the form log p(x) = alog x + c for some constants a and c. See Figure 2.11(b) for an 
illustration (this is known as a power law). This distribution has the following properties 


km mk 
mean = ifk >1, mode=m, var = ————_~—_—- ifk > 2 (2.64) 
k=l ; (k —1)2(k — 2) 
8. In the USA, 400 Americans have more wealth than half of all Americans combined. (Source: 


http://www.politifact.com/wisconsin/statements/2011/mar/10/michael-moore/michael-moore-s 
ays-400-americans-have-more-wealth-.) See (Hacker and Pierson 2010) for a political analysis of how such an 
extreme distribution of income has arisen in a democratic country. 
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Pareto distribution : Pareto(m=1, k) on log scale 


== m=0.01, k=0.10 
= = = = m=0.00, k=0.50 
m=1.00, k=1.00 


(a) (b) 


Figure 2.11 (a) The Pareto distribution Pareto(x|m, k) for m = 1. (b) The pdf on a log-log scale. Figure 
generated by paretoPlot. 


Joint probability distributions 


So far, we have been mostly focusing on modeling univariate probability distributions. In this 
section, we start our discussion of the more challenging problem of building joint probability 
distributions on multiple related random variables; this will be a central topic in this book. 

A joint probability distribution has the form p(21,...,2p) for a set of D > 1 variables, 
and models the (stochastic) relationships between the variables. If all the variables are discrete, 
we can represent the joint distribution as a big multi-dimensional array, with one variable per 
dimension. However, the number of parameters needed to define such a model is O(K 7 yy 
where K is the number of states for each variable. 

We can define high dimensional joint distributions using fewer parameters by making con- 
ditional independence assumptions, as we explain in Chapter 10. In the case of continuous 
distributions, an alternative approach is to restrict the form of the pdf to certain functional 
forms, some of which we will examine below. 


Covariance and correlation 


The covariance between two rv’s X and Y measures the degree to which X and Y are (linearly) 
related. Covariance is defined as 


cov[X,Y] ê E((X —E[X])(Y —E[Y]] =E[XY]-E[X]E[Y] (2.65) 
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Figure 2.12 Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note 
that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope 
of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the 
center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y 
is zero. Source: http: //en.wikipedia.org/wiki/File:Correlation_examples.png 


If x is a d-dimensional random vector, its covariance matrix is defined to be the following 
symmetric, positive definite matrix: 


cov[x} ê E | — E [x])(x — E [x])7 (2.66) 
var [X41] cov |X1, Xə] +++ cov [X4, Xa] 
= COV [X2, X1] var [Xo] eo COV [Xo, Xa] (2.67) 
eee ov RE ae ea 


Covariances can be between 0 and infinity. Sometimes it is more convenient to work with a 
normalized measure, with a finite upper bound. The (Pearson) correlation coefficient between 


X and Y is defined as 
cov [X,Y] 


corr [X, Y] = ——— (2.68) 
var |X] var [Y] 
A correlation matrix has the form 
corr [X1, Xı] corr[X1,X2] +++ corr [X1, Xa] 
R= : A (2.69) 
corr |Xa, Xı] corr|[Xa, X2] -+-- corr[Xa, Xa] 


One can show (Exercise 4.3) that —1 < corr [X,Y] < 1. Hence in a correlation matrix, each 
entry on the diagonal is 1, and the other entries are between -l and 1. 

One can also show that corr [X,Y] = 1 if and only if Y = aX + b for some parameters a 
and b, i.e., if there is a linear relationship between X and Y (see Exercise 4.4). Intuitively one 
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might expect the correlation coefficient to be related to the slope of the regression line, i.e., the 
coefficient a in the expression Y = aX + b. However, as we show in Equation 7.99 later, the 
regression coefficient is in fact given by a = cov [X, Y] /var [X]. A better way to think of the 
correlation coefficient is as a degree of linearity: see Figure 2.12. 

If X and Y are independent, meaning p(X,Y) = p(X)p(Y) (see Section 2.2.4), then 
cov [X,Y] = 0, and hence corr[X,Y] = 0 so they are uncorrelated. However, the con- 
verse is not true: uncorrelated does not imply independent. For example, let X ~ U(—1,1) and 
Y = X?. Clearly Y is dependent on X (in fact, Y is uniquely determined by X), yet one 
can show (Exercise 4.1) that corr [X,Y] = 0. Some striking examples of this fact are shown in 
Figure 2.12. This shows several data sets where there is clear dependendence between X and Y, 
and yet the correlation coefficient is 0. A more general measure of dependence between random 
variables is mutual information, discussed in Section 2.8.3. This is only zero if the variables truly 
are independent. 


The multivariate Gaussian 


The multivariate Gaussian or multivariate normal (MVN) is the most widely used joint prob- 
ability density function for continuous variables. We discuss MVNs in detail in Chapter 4; here 
we just give some definitions and plots. 

The pdf of the MVN in D dimensions is defined by the following: 


A 1 1 Ty -1 
where u = E[x] € RP is the mean vector, and © = cov [x] is the D x D covariance 


matrix. Sometimes we will work in terms of the precision matrix or concentration matrix 
instead. This is just the inverse covariance matrix, A = >». The normalization constant 
(27)—?/?| A|*/? just ensures that the pdf integrates to 1 (see Exercise 4.5). 

Figure 2.13 plots some MVN densities in 2d for three different kinds of covariance matrices. 
A full covariance matrix has D(D + 1)/2 parameters (we divide by 2 since © is symmetric). A 
diagonal covariance matrix has D parameters, and has 0s in the off-diagonal terms. A spherical 
or isotropic covariance, X = oI D, has one free parameter. 


Multivariate Student t distribution 


A more robust alternative to the MVN is the multivariate Student t distribution, whose pdf is 
given by 


T(v/2+D/2) |2712 1 _ OF) 
T (x|p, X, v) = l a ) —— x LF zæ ~~ > ‘(x ~~ Lt) (2.71) 
T(v/2+ D/2) —(¥42) 


= vt? x [Lt (x — V e] 


Tw/2) (2.72) 


where © is called the scale matrix (since it is not exactly the covariance matrix) and V = v&. 
This has fatter tails than a Gaussian. The smaller v is, the fatter the tails. As v — oo, the 


2.5.4 


2.5. Joint probability distributions 47 


full diagonal 


bt 
Loo -novaa 
na 


L4 


(c) (d) 
Figure 2.13 We show the level sets for 2d Gaussians. (a) A full covariance matrix has elliptical contours. 


(b) A diagonal covariance matrix is an axis aligned ellipse. (c) A spherical covariance matrix has a circular 
shape. (d) Surface plot for the spherical Gaussian in (c). Figure generated by gaussPlot2Ddemo. 


distribution tends towards a Gaussian. The distribution has the following properties 
mean = u, mode = u, Cov = =, (2.73) 
g= 


Dirichlet distribution 


A multivariate generalization of the beta distribution is the Dirichlet distribution’, which has 
support over the probability simplex, defined by 


kK 
Sx={x20< a <1,) rr =1} (2.74) 
k=1 


The pdf is defined as follows: 


K 
A 1 Il on 
k=1 


9. Johann Dirichlet was a German mathematician, 1805-1859. 
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Figure 2.14 (a) The Dirichlet distribution when K = 3 defines a distribution over the simplex, which 
can be represented by the triangular surface. Points on this surface satisfy 0 < 0, < 1 and ae 0k = 
1. (b) Plot of the Dirichlet density when œ = (2,2,2). (c) @ = (20,2,2). Figure generated by 
visDirichletGui, by Jonathan Huang. (d) œ = (0.1,0.1,0.1). (The comb-like structure on the edges is 
a plotting artifact.) Figure generated by dirichlet3dPlot. 
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Figure 2.15 Samples from a 5-dimensional symmetric Dirichlet distribution for different parameter values. 
(a) a = (0.1,...,0.1). This results in very sparse distributions, with many 0s. (b) œ = (1,...,1). This 
results in more uniform (and dense) distributions. Figure generated by dirichletHistogramDemo. 
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where B(a1,...,@x) is the natural generalization of the beta function to K variables: 
(2.76) 


where a = ae Qk. 

Figure 2.14 shows some plots of the Dirichlet when K = 3, and Figure 2.15 for some sampled 
probability vectors. We see that œo = Sor a, controls the strength of the distribution (how 
peaked it is), and the a, control where the peak occurs. For example, Dir(1, 1,1) is a uniform 
distribution, Dir(2, 2,2) is a broad distribution centered at (1/3, 1/3, 1/3), and Dir(20, 20, 20) 
is a narrow distribution centered at (1/3, 1/3, 1/3). If a, < 1 for all k, we get “spikes” at the 
corner of the simplex. 

For future reference, the distribution has these properties 


ap —1 _ Qklao — A) 
Aoa 1) 
0 


z [ep] = Œ, mode [x;] = (2.77) 


Qo ag — K’ 


where ap = )_,, &p. Often we use a symmetric Dirichlet prior of the form a; = a/K. In this 


case, the mean becomes 1/K, and the variance becomes var [xz] = Ba So increasing a 


increases the precision (decreases the variance) of the distribution. 


Transformations of random variables 


If x ~ p() is some random variable, and y = f(x), what is the distribution of y? This is the 
question we address in this section. 
Linear transformations 
Suppose f() is a linear function: 
y = f(x) =Ax+b (2.78) 


In this case, we can easily derive the mean and covariance of y as follows. First, for the mean, 
we have 


ify] = E [Ax + b] = Ap + b (2.79) 


where u = E [x]. This is called the linearity of expectation. If f() is a scalar-valued function, 
f(x) =a’x + b, the corresponding result is 


i [ax +b] = a” u +b (2.80) 


For the covariance, we have 
cov [y] = cov [Ax + b] = ANAT (2.81) 


where © = cov [x]. We leave the proof of this as an exercise. If f() is scalar valued, the result 
becomes 


var [y] = var [a’x + 6] = a’Za (2.82) 
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We will use both of these results extensively in later chapters. Note, however, that the mean 
and covariance only completely define the distribution of y if x is Gaussian. In general we must 
use the techniques described below to derive the full distribution of y, as opposed to just its 
first two moments. 


General transformations 


If X is a discrete rv, we can derive the pmf for y by simply summing up the probability mass 
for all the x’s such that f(a) = y: 


pyy)= >> pela) (2.83) 
x:f(x)=y 
For example, if f(X) = 1 if X is even and f(X) = 0 otherwise, and p,(X) is uniform on the 
set {1,..., 10}, then py(1) = J se{2,4,6,8,10} Pa (£) = 0.5, and p, (0) = 0.5 similarly. Note 
that in this example, f is a many-to-one function. 
If X is continuous, we cannot use Equation 2.83 since p,.(a) is a density, not a pmf, and we 
cannot sum up densities. Instead, we work with cdf’s, and write 


Ply) SPY < y) = PGA) < y) = P(X € {x| f(x) < y}) (2.84) 


We can derive the pdf of y by differentiating the cdf. 
In the case of monotonic and hence invertible functions, we can write 


Py(y) = P(f(X) < y) = P(X < f(y) = Pelf U) (2.85) 
Taking derivatives we get 
wa d 4 dx d dx 
= —P,(y) = —P, y))=——P. = —p, (x 2. 
pyly) È EPO) = EPTO) = EF Palo) = pela) 2.86) 
where x = f~'(y). We can think of dx as a measure of volume in the x-space; similarly dy 


measures volume in y space. Thus az measures the change in volume. Since the sign of this 


change is not important, we take the absolute value to get the general expression: 


Py(y) = Pe(x)|— (2.87) 


This is called change of variables formula. We can understand this result more intuitively as 
follows. Observations falling in the range (x, x +6) will get transformed into (y, y+ dy), where 
Dz(x)oz ~ py(y)dy. Hence py(y) © pz (2)|$2|. For example, suppose X ~ U(—1,1), and 
Y = X?. Then p(y) = Ly-3, See also Exercise 2.10. 


Multivariate change of variables * 


We can extend the previous results to multivariate distributions as follows. Let f be a function 
that maps R” to R”, and let y = f(x). Then its Jacobian matrix J is given by 


Out see out 
Ayr, = - 

Js ê (Win af, . (2.88) 
Oler- En) Aam "an, 


O21 OF my 


2.6.3 


2.6. Transformations of random variables 51 


| det J| measures how much a unit cube changes in volume when we apply f. 
If f is an invertible mapping, we can define the pdf of the transformed variables using the 
Jacobian of the inverse mapping y > x: 


a 
Py(y) = px (x)| det (=) | = pa (x)| det Jy_,x| (2.89) 


In Exercise 4.5 you will use this formula to derive the normalization constant for a multivariate 
Gaussian. 

As a simple example, consider transforming a density from Cartesian coordinates x = (21, £2) 
to polar coordinates y = (r, 0), where xı = r cos 0 and z2 = rsin@. Then 


Ox Ou ; 
Jun = (E E) ho ro) eso 
and 
| det J| = |r cos? 0 + r sin? 0| = |r| (2.91) 
Hence 
Py(y) = Px(x)| det J| (2.92) 
Prolr,0) = Pa, æa(£1, £2)r = Paine (r cos 0, r sin 8)r (2.93) 


To see this geometrically, notice that the area of the shaded patch in Figure 2.16 is given by 
P(r <R<r+dr,0<0<0+d0) = prøolr,0)drdð (2.94) 


In the limit, this is equal to the density at the center of the patch, p(r, 0), times the size of the 
patch, r dr d0. Hence 


Pr ol(r,0)drdð = Pr, x.(rcosé,rsiné)r dr dé (2.95) 


Central limit theorem 


Now consider N random variables with pdf’s (not necessarily Gaussian) p(a;), each with mean 
u and variance o°. We assume each variable is independent and identically distributed 
or iid for short. Let Sy = an X; be the sum of the rv’s. This is a simple but widely 
used transformation of rv’s. One can show that, as N increases, the distribution of this sum 
approaches 


1 (s— Np)? 
p(Sn =s)= ane exp ( INe? (2.96) 


Hence the distribution of the quantity 
a SN- Nu | X-u 
ov N a/ VN 


converges to the standard normal, where X = + Da x; is the sample mean. This is called 
the central limit theorem. See e.g., Jaynes 2003, p222) or (Rice 1995, p169) for a proof. 

In Figure 2.17 we give an example in which we compute the mean of rv's drawn from a beta 
distribution. We see that the sampling distribution of the mean value rapidly converges to a 
Gaussian distribution. 


Zn (2.97) 
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Figure 2.16 Change of variables from polar to Cartesian. The area of the shaded patch is r dr d0. Based 
on (Rice 1995) Figure 3.16. 
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(a) (b) 


Figure 2.17 The central limit theorem in pictures. We plot a histogram of + pal Zij, where Tij ~ 
Beta(1, 5), for 7 = 1 : 10000. As N — oo, the distribution tends towards a Gaussian. (a) N = 1. (b) 
N = 5. Based on Figure 2.6 of (Bishop 2006a). Figure generated by centralLimitDemo. 


Monte Carlo approximation 


In general, computing the distribution of a function of an rv using the change of variables 
formula can be difficult. One simple but powerful alternative is as follows. First we generate 
S samples from the distribution, call them z1, ..., xs. (There are many ways to generate such 
samples; one popular method, for high dimensional distributions, is called Markov chain Monte 
Carlo or MCMC; this will be explained in Chapter 24.) Given the samples, we can approximate 
the distribution of f(X) by using the empirical distribution of { f(a,)}9_,. This is called a 
Monte Carlo approximation, named after a city in Europe known for its plush gambling casinos. 
Monte Carlo techniques were first developed in the area of statistical physics — in particular, 
during development of the atomic bomb — but are now widely used in statistics and machine 
learning as well. 

We can use Monte Carlo to approximate the expected value of any function of a random 
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Figure 2.18 Computing the distribution of y = x°, where p(x) is uniform (left). The analytic result is 
shown in the middle, and the Monte Carlo approximation is shown on the right. Figure generated by 
changeOfVarsDemold. 


variable. We simply draw samples, and then compute the arithmetic mean of the function 
applied to the samples. This can be written as follows: 


LF] = f Felon = 5 fle.) 2.98) 


a= 


where xs ~ p(X). This is called Monte Carlo integration, and has the advantage over numerical 
integration (which is based on evaluating the function at a fixed grid of points) that the function 
is only evaluated in places where there is non-negligible probability. 

By varying the function f(), we can approximate many quantities of interest, such as 


Ad 


T= 5 Lists > E[X] 
© $ Deni (2s — T)? > var [X] 

© s#{xs <c} > P(X <0) 

e median{x,...,75} — median(X) 


We give some examples below, and will see many more in later chapters. 


Example: change of variables, the MC way 


In Section 2.6.2, we discussed how to analytically compute the distribution of a function of a 
random variable, y = f(x). A much simpler approach is to use a Monte Carlo approximation. 
For example, suppose x ~ Unif(—1,1) and y = 2”. We can approximate p(y) by drawing 
many samples from p(x), squaring them, and computing the resulting empirical distribution. 
See Figure 2.18 for an illustration. We will use this technique extensively in later chapters. See 
also Figure 5.2. 
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Figure 2.19 Estimating m by Monte Carlo integration. Blue points are inside the circle, red crosses are 
outside. Figure generated by mcEstimatePi. 


Example: estimating m by Monte Carlo integration 


MC approximation can be used for many applications, not just statistical ones. Suppose we want 
to estimate 7. We know that the area of a circle with radius r is r?, but it is also equal to the 
following definite integral: 


I= l / I(x? + y? < r*)daxdy (2.99) 
Hence m = I/(r?). Let us approximate this by Monte Carlo integration. Let f(x,y) = 
I(x? +y? < r?) be an indicator function that is 1 for points inside the circle, and 0 outside, 
and let p(x) and p(y) be uniform distributions on [—r,r], so p(x) = p(y) = 1/(2r). Then 


I = (2r)(2r) J J f(a,v)p(o)p(y)dedy (2.100) 
= 4r? / / f(x, y)p(«)p(y)dady (2.101) 


Q 


S 
1 
4r? z 5 f (£55 Ys) (2.102) 
=i 


We find f = 3.1416 with standard error 0.09 (see Section 2.7.3 for a discussion of standard 
errors). We can plot the points that are accepted/ rejected as in Figure 2.19. 


Accuracy of Monte Carlo approximation 


The accuracy of an MC approximation increases with sample size. This is illustrated in Fig- 
ure 2.20, On the top line, we plot a histogram of samples from a Gaussian distribution. On 
the bottom line, we plot a smoothed version of these samples, created using a kernel density 
estimate (Section 14.7.2). This smoothed distribution is then evaluated on a dense grid of points 
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10 samples 100 samples 


Figure 2.20 10 and 100 samples from a Gaussian distribution, N (u = 1.5,07 = 0.25). Solid red 
line is true pdf. Top line: histogram of samples. Bottom line: kernel density estimate derived from 
samples in dotted blue, solid red line is true pdf. Based on Figure 4.1 of (Hoff 2009). Figure generated by 
mcAccuracyDemo. 


and plotted. Note that this smoothing is just for the purposes of plotting, it is not used for the 
Monte Carlo estimate itself. 

If we denote the exact mean by u = E[f(X)], and the MC approximation by ji, one can 
show that, with independent samples, 


(A— u) 2 N(O, = (2.103) 
where 
o? = var [f(X)] = E [f(X)?] —E[f(X)]? (2.104) 


2 


This is a consequence of the central-limit theorem. Of course, øf is unknown in the above 


expression, but it can also be estimated by MC: 


S 
ô = 5 Utes) - A) (2.105) 


ĉ 
< à< pu+1.96— > 7% 0.95 (2.106) 
E 3) 
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The term T is called the (numerical or empirical) standard error, and is an estimate of our 
uncertainty about our estimate of ju. (See Section 6.2 for more discussion on standard errors.) 
If we want to report an answer which is accurate to within +e with probability at least 95%, 
we need to use a number of samples S' which satisfies 1.96,/a?/S' < e. We can approximate 
the 1.96 factor by 2, yielding S > i 


€ 


Information theory 


information theory is concerned with representing data in a compact fashion (a task known as 
data compression or source coding), as well as with transmitting and storing it in a way that 
is robust to errors (a task known as error correction or channel coding). At first, this seems 
far removed from the concerns of probability theory and machine learning, but in fact there is 
an intimate connection. To see this, note that compactly representing data requires allocating 
short codewords to highly probable bit strings, and reserving longer codewords to less probable 
bit strings. This is similar to the situation in natural language, where common words (such as 
“a”, “the”, “and”) are generally much shorter than rare words. Also, decoding messages sent over 
noisy channels requires having a good probability model of the kinds of messages that people 
tend to send. In both cases, we need a model that can predict which kinds of data are likely 
and which unlikely, which is also a central problem in machine learning (see (MacKay 2003) for 
more details on the connection between information theory and machine learning). 

Obviously we cannot go into the details of information theory here (see e.g., (Cover and 
Thomas 2006) if you are interested to learn more). However, we will introduce a few basic 
concepts that we will need later in the book. 


Entropy 


The entropy of a random variable X with distribution p, denoted by H(X) or sometimes 
H (p), is a measure of its uncertainty. In particular, for a discrete variable with K states, it is 


defined by 


K 
H(X) = — 5° p(X = k) logy p(X = k) (2.107) 
k=1 


Usually we use log base 2, in which case the units are called bits (short for binary digits). If 
we use log base e, the units are called nats. For example, if X € {1,...,5} with histogram 
distribution p = [0.25, 0.25, 0.2, 0.15, 0.15], we find H = 2.2855. The discrete distribution with 
maximum entropy is the uniform distribution (see Section 9.2.6 for a proof). Hence for a K-ary 
random variable, the entropy is maximized if p(x = k) = 1/K; in this case, H (X) = log, K. 
Conversely, the distribution with minimum entropy (which is zero) is any delta-function that 
puts all its mass on one state. Such a distribution has no uncertainty. In Figure 2.5(b), where 
we plotted a DNA sequence logo, the height of each bar is defined to be 2 — H, where H is 
the entropy of that distribution, and 2 is the maximum possible entropy. Thus a bar of height 0 
corresponds to a uniform distribution, whereas a bar of height 2 corresponds to a deterministic 
distribution. 
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Figure 2.21 Entropy of a Bernoulli random variable as a function of 0. The maximum entropy is 
log, 2 = 1. Figure generated by bernoulliEntropyFig. 


For the special case of binary random variables, X € {0,1}, we can write p(X = 1) = 0 
and p(X = 0) = 1 — 0. Hence the entropy becomes 


H(X) = —|p(X = 1) logs p(X = 1) + p(X = 0) log, p(X = 0)] (2.108) 
= —[0log, 0 + (1 — 0) logs(1 — 8)] (2.109) 


This is called the binary entropy function, and is also written H (@). We plot this in Figure 2.21. 
We see that the maximum value of 1 occurs when the distribution is uniform, 0 = 0.5. 


KL divergence 


One way to measure the dissimilarity of two probability distributions, p and q, is known as the 
Kullback-Leibler divergence (KL divergence) or relative entropy. This is defined as follows: 


K 
KL (p||q) ê XO pe log 2 (2.110) 
k=l dk 


where the sum gets replaced by an integral for pdfs.!° We can rewrite this as 


KL (pllq) = XC pe log pk — X` pr log qr = —H (p) + H (p, q) (2.1) 
k k 


where H (p, q) is called the cross entropy, 


H (p, q) = — X pr log qx (2.112) 
k 


One can show (Cover and Thomas 2006) that the cross entropy is the average number of bits 
needed to encode data coming from a source with distribution p when we use model q to 


10. The KL divergence is not a distance, since it is asymmetric. One symmetric version of the KL divergence is the 
Jensen-Shannon divergence, defined as JS (p1, p2) = 0.5KL (pi||q) + 0.5KL (pa||q), where q = 0.5p1 + 0.5p2. 
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define our codebook. Hence the “regular” entropy H (p) = H (p, p), defined in Section 2.8.1, is 
the expected number of bits if we use the true model, so the KL divergence is the difference 
between these. In other words, the KL divergence is the average number of extra bits needed to 
encode the data, due to the fact that we used distribution q to encode the data instead of the 
true distribution p. 

The “extra number of bits” interpretation should make it clear that KL (p||q) > 0, and that 
the KL is only equal to zero iff q = p. We now give a proof of this important result. 


Theorem 2.8.1. (Information inequality) KL (p||q) > 0 with equality iff p = q. 


Proof. To prove the theorem, we need to use Jensen’s inequality. This states that, for any 
convex function f, we have that 


f (>: sx] < = rif (xi) (2.113) 
i=1 i=1 


where A; > 0 and Ya Ai = 1. This is clearly true for n = 2 (by definition of convexity), and 
can be proved by induction for n > 2. 

Let us now prove the main theorem, following (Cover and Thomas 2006, p28). Let A = {a : 
p(x) > O} be the support of p(x). Then 


“KL (lla) =~ Dp log A = LOIS a (2.14 
< log)? oo = log X` q(x) (2.115) 
ZEA p(x ZEA 
< log ` q(x) = log1 = (2.116) 
LEX 


where the first inequality follows from Jensen's. Since log(2) is a strictly concave function, we 
have equality in Equation 2.115 iff p(x) = cq(a) for some c. We have equality in Equation 2.116 
iff orca UL) = X rex U(x) = 1, which implies c = 1. Hence KL (p||q) = 0 iff p(x) = q(x) 
for all x. 


One important consequence of this result is that the discrete distribution with the maximum 
entropy is the uniform distribution. More precisely, H (X) < log|#|, where |X| is the number 
of states for X, with equality iff M is uniform. To see this, let u(x) = 1/|A’|. Then 


0 < KL (plu) = 2 w) ) log = (2.117) 


(2.118) 
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This is a formulation of E principle of insufficient reason, which argues in favor of 
using uniform distributions when there are no other reasons to favor one distribution over 
another. See Section 9.2.6 for a discussion of how to create distributions that satisfy certain 
constraints, but otherwise are as least-commital as possible. (For example, the Gaussian satisfies 
first and second moment constraints, but otherwise has maximum entropy.) 


2.8.3 


2.8.3.1 


2.8. Information theory 59 


Mutual information 


Consider two random variables, X and Y. Suppose we want to know how much knowing one 
variable tells us about the other. We could compute the correlation coefficient, but this is only 
defined for real-valued random variables, and furthermore, this is a very limited measure of 
dependence, as we saw in Figure 2.12. A more general approach is to determine how similar the 
joint distribution p(X, Y) is to the factored distribution p(X)p(Y). This is called the mutual 
information or MI, and is defined as follows: 


Pp 
p(x)ply) 


We have I(X;Y) > 0 with equality iff p(X,Y) = p(X)p(Y). That is, the MI is zero iff the 
variables are independent. 

To gain insight into the meaning of MI, it helps to re-express it in terms of joint and conditional 
entropies. One can show (Exercise 2.12) that the above expression is equivalent to the following: 


I(X;Y) £ KL (p(X, YIPO) = F F pla, y) log PE 2119) 


I(X;Y) =H(X) —H(X|Y) = H(Y) —H(Y|X) (2.120) 


where H(Y|X) is the conditional entropy, defined as H(Y|X) = X, p(x)H(Y|X = 2). 
Thus we can interpret the MI between X and Y as the reduction in uncertainty about X after 
observing Y, or, by symmetry, the reduction in uncertainty about Y after observing X. We will 
encounter several applications of MI later in the book. See also Exercises 2.13 and 2.14 for the 
connection between MI and correlation coefficients. 

A quantity which is closely related to MI is the pointwise mutual information or PMI. For 
two events (not random variables) x and y, this is defined as 


miegi log OB = tog PEW) — Jog Plu) (2.121) 


p(x)p(y) p(z) p(y) 
This measures the discrepancy between these events occuring together compared to what would 
be expected by chance. Clearly the MI of X and Y is just the expected value of the PMI. 
Interestingly, we can rewrite the PMI as follows: 


p(xly) p(y|x) 


p(x) p(y) 


PMI(x, y) = log (2.122) 
This is the amount we learn from updating the prior p(x) into the posterior p(x|y), or equiva- 
lently, updating the prior p(y) into the posterior p(y|z). 


Mutual information for continuous random variables * 


The above formula for MI is defined for discrete random variables. For continuous random 
variables, it is common to first discretize or quantize them, by dividing the ranges of each 
variable into bins, and computing how many values fall in each histogram bin (Scott 1979). We 
can then easily compute the MI using the formula above (see mutual InfoAl11PairsMixed for 
some code, and miMixedDemo for a demo). 

Unfortunately, the number of bins used, and the location of the bin boundaries, can have 
a significant effect on the results. One way around this is to try to estimate the MI directly, 
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Figure 2.22 Left: Correlation coefficient vs maximal information criterion (MIC) for all pairwise relation- 
ships in the WHO data. Right: scatter plots of certain pairs of variables. The red lines are non-parametric 
smoothing regressions (Section 15.4.6) fit separately to each trend. Source: Figure 4 of (Reshed et al. 201) . 
Used with kind permission of David Reshef and the American Association for the Advancement of Science. 


without first performing density estimation (Learned-Miller 2004). Another approach is to try 
many different bin sizes and locations, and to compute the maximum MI achieved. This 
statistic, appropriately normalized, is known as the maximal information coefficient (MIC) 
(Reshed et al. 2011). More precisely, define 


maxgeg(x,y) 1 (X(G); Y(G)) 
log min(z, y) 


m(x,y) = (2.123) 
where G(x, y) is the set of 2d grids of size xz x y, and X(G), Y (G) represents a discretization of 
the variables onto this grid. (The maximization over bin locations can be performed efficiently 
using dynamic programming (Reshed et al. 2011).) Now define the MIC as 


MIC= max m(z,y) (2.124) 
tyty <B 
where B is some sample-size dependent bound on the number of bins we can use and still 
reliably estimate the distribution ((Reshed et al. 2011) suggest B = N®6). It can be shown that 
the MIC lies in the range [0, 1], where 0 represents no relationship between the variables, and 1 
represents a noise-free relationship of any form, not just linear. 

Figure 2.22 gives an example of this statistic in action. The data consists of 357 variables 
measuring a variety of social, economic, health and political indicators, collected by the World 
Health Organization (WHO). On the left of the figure, we see the correlation coefficient (CC) 
plotted against the MIC for all 63,566 variable pairs. On the right of the figure, we see scatter 
plots for particular pairs of variables, which we now discuss: 


e The point marked C has a low CC and a low MIC. The corresponding scatter plot makes it 
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clear that there is no relationship between these two variables (percentage of lives lost to 
injury and density of dentists in the population). 


e The points marked D and H have high CC (in absolute value) and high MIC, because they 
represent nearly linear relationships. 


e The points marked E, F, and G have low CC but high MIC. This is because they correspond 
to non-linear (and sometimes, as in the case of E and F, non-functional, i.e., one-to-many) 
relationships between the variables. 


In summary, we see that statistics (such as MIC) based on mutual information can be used 
to discover interesting relationships between variables in a way that simpler measures, such as 
correlation coefficients, cannot. For this reason, the MIC has been called “a correlation for the 
21st century” (Speed 2011). 


Exercises 


Exercise 2.1 Probabilities are sensitive to the form of the question that was used to generate the answer 


(Source: Minka.) My neighbor has two children. Assuming that the gender of a child is like a coin flip, 
it is most likely, a priori, that my neighbor has one boy and one girl, with probability 1/2. The other 
possibilities—two boys or two girls—have probabilities 1/4 and 1/4. 


a. Suppose I ask him whether he has any boys, and he says yes. What is the probability that one child is 
a girl? 

b. Suppose instead that I happen to see one of his children run by, and it is a boy. What is the probability 
that the other child is a girl? 


Exercise 2.2 Legal reasoning 


(Source: Peter Lee.) Suppose a crime has been committed. Blood is found at the scene for which there is 
no innocent explanation. It is of a type which is present in 1% of the population. 


a. The prosecutor claims: “There is a 1% chance that the defendant would have the crime blood type if he 
were innocent. Thus there is a 99% chance that he guilty”. This is known as the prosecutor’s fallacy. 
What is wrong with this argument? 


b. The defender claims: “The crime occurred in a city of 800,000 people. The blood type would be 
found in approximately 8000 people. The evidence has provided a probability of just 1 in 8000 that 
the defendant is guilty, and thus has no relevance.” This is known as the defender’s fallacy. What is 
wrong with this argument? 


Exercise 2.3 Variance of a sum 


Show that the variance of a sum is var [X + Y] = var [X] + var [Y] + 2cov [X,Y], where cov [X,Y] 
is the covariance between X and Y 


Exercise 2.4 Bayes rule for medical diagnosis 


(Source: Koller.) After your yearly checkup, the doctor has bad news and good news. The bad news is that 
you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing 
positive given that you have the disease is 0.99, as is the probability of tetsing negative given that you don't 
have the disease). The good news is that this is a rare disease, striking only one in 10,000 people. What are 
the chances that you actually have the disease? (Show your calculations as well as giving the final result.) 
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Exercise 2.5 The Monty Hall problem 


(Source: Mackay.) On a game show, a contestant is told the rules as follows: 


There are three doors, labelled 1, 2, 3. A single prize has been hidden behind one of them. You 
get to select one door. Initially your chosen door will not be opened. Instead, the gameshow host 
will open one of the other two doors, and he will do so in such a way as not to reveal the prize. For 
example, if you first choose door 1, he will then open one of doors 2 and 3, and it is guaranteed 
that he will choose which one to open so that the prize will not be revealed. 


At this point, you will be given a fresh choice of door: you can either stick with your first choice, 
or you can switch to the other closed door. All the doors will then be opened and you will receive 
whatever is behind your final choice of door. 


Imagine that the contestant chooses door 1 first; then the gameshow host opens door 3, revealing nothing 
behind the door, as promised. Should the contestant (a) stick with door 1, or (b) switch to door 2, or (c) 
does it make no difference? You may assume that initially, the prize is equally likely to be behind any of 
the 3 doors. Hint: use Bayes rule. 

Exercise 2.6 Conditional independence 

(Source: Koller.) 


a. Let H € {1,..., K} be a discrete random variable, and let e; and e2 be the observed values of two 
other random variables Æı and £2. Suppose we wish to calculate the vector 


P(H\e1, e2) iad (P(H = lle1,e€2),..., P(H = K|e1,e2)) 


Which of the following sets of numbers are sufficient for the calculation? 
i. P(e1, 2), P(H), P(eı|H), P(e2|H) 
ii. P(e1, e2), P(H), P(e, e2| H) 
ii. P(ei|H), P(e2|H), P(H) 
b. Now suppose we now assume E, | E2|H (ie. Hy and E2 are conditionally independent given H). 
Which of the above 3 sets are sufficent now? 
Show your calculations as well as giving the final result. Hint: use Bayes rule. 


Exercise 2.7 Pairwise independence does not imply mutual independence 


We say that two random variables are pairwise independent if 


p(X2|X1) = p(X2) (2.125 
and hence 
p(X2, X1) = p(X1)p(X2|X1) = p(X1)p(X2) (2.126 


We say that n random variables are mutually independent if 


p(XilXs)=p(Xi) VEC {1.0} \ {i 2127 
and hence 
P(Xin) = | [ p(X) (2.128) 


i=1 


Show that pairwise independence between all pairs of variables does not necessarily imply mutual inde- 
pendence. It suffices to give a counter example. 
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Exercise 2.8 Conditional independence iff joint factorizes 
In the text we said X L Y|Z iff 
p(z, y|z) = p(z|z)p(ylz) (2.129) 


for all x, y, z such that p(z) > 0. Now prove the following alternative definition: X L Y|Z iff there exist 
function g and h such that 


p(x, y|z) = g(x, z)h(y, z) (2.130) 
for all x, y, z such that p(z) > 0. 


Exercise 2.9 Conditional independence 


(Source: Koller.) Are the following properties true? Prove or disprove. Note that we are not restricting 
attention to distributions that can be represented by a graphical model. 


a. True or false? (X L W|Z,Y)A(X LY|Z) => (X LY,W|Z) 
b. True or false? (X L Y|Z) A(X LY|W) => (X LY|Z,W) 


Exercise 2.10 Deriving the inverse gamma density 
Let X ~ Ga(a, b), i.e. 


Ga(ala,b) = OT gt ten (2.131) 
a 
Let Y = 1/X. Show that Y ~ IG(a, b), ie., 


IG(a|shape = a, scale = b) = aoe (2.132) 
a 


Hint: use the change of variables formula. 


Exercise 2.11 Normalization constant for a ID Gaussian 


The normalization constant for a zero-mean Gaussian is given by 


b Pe 
Z= f exp (2) dx (2.133) 


where a = —co and b = oo. To compute this, consider its square 


b b 2 2 
Ge / f exp (-* = ) dzdy (2134) 


Let us change variables from cartesian (x,y) to polar (r,0) using x = rcos@ and y = rsin@. Since 
dxdy = rdrdé, and cos?@ + sin? 0 = 1, we have 


27 love} r2 
Z = / f r exp (-s) drd (2.135) 
0 0 20 


Evaluate this integral and hence show Z = o\/(27). Hint 1: separate the integral into a product of 


two terms, the first of which (involving d0) is constant, so is easy. Hint 2: if u = eT 20? then 
du/dr = ner” AR", so the second integral is also easy (since f u’(r)dr = u(r). 
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Exercise 2.12 Expressing mutual information in terms of entropies 
Show that 


I(X,Y) = H(X) — H(X|Y) = H(Y) — H(¥|X) (2.136) 


Exercise 2.13 Mutual information for correlated normals 


(Source: (Cover and Thomas 1991, Q9.3).) Find the mutual information [(X1, X2) where X has a bivariate 
normal distribution: 


Xı o po 
a ~N (o. (g o2 (2.137) 


Evaluate [(.X1, X2) for p = 1, p = 0 and p = —1 and comment. Hint: The (differential) entropy of a 
d-dimensional Gaussian is 


h(X) = 5 log [(2me)* det z] (2.138) 
In the ld case, this becomes 
h(X) = 51082 [2mec*] (2.139) 


Hint: log(0) = oo. 


Exercise 2.14 A measure of correlation (normalized mutual information) 


(Source: (Cover and Thomas 1991, Q2.20).) Let X and Y be discrete random variables which are identically 
distributed (so H(X) = H(Y)) but not necessarily independent. Define 


27 HIX) 
a. Show r = ope 


b. Show0<r<1 
c. When is r = 0? 
d. When is r = 1? 


Exercise 2.15 MLE minimizes KL divergence to the empirical distribution 
Let Pemp (x) be the empirical distribution, and let q(a|@) be some model. Show that argmin, KL (pemp||q) 
is obtained by q(x) = q(x; Ô), where @ is the MLE. Hint: use non-negativity of the KL divergence. 


Exercise 2.16 Mean, mode, variance for the beta distribution 


Suppose 0 ~ Beta(a, b). Derive the mean, mode and variance. 


Exercise 2.17 Expected value of the minimum 


Suppose X,Y are two points sampled independently and uniformly at random from the interval [0, 1]. 
What is the expected location of the left most point? 
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Generative models for discrete data 


Introduction 


In Section 2.2.3.2, we discussed how to classify a feature vector x by applying Bayes rule to a 
generative classifier of the form 


p(y = cx, 8) x p(xly = c, A)p(y = 4) 3.) 
The key to using such models is specifying a suitable form for the class-conditional density 
p(x|y = c,@), which defines what kind of data we expect to see in each class. In this chapter, 


we focus on the case where the observed data are discrete symbols. We also discuss how to 
infer the unknown parameters 0 of such models. 


Bayesian concept learning 


Consider how a child learns to understand the meaning of a word, such as “dog”. Presumably 
the child’s parents point out positive examples of this concept, saying such things as, “look at 
the cute dog!”, or “mind the doggy”, etc. However, it is very unlikely that they provide negative 
examples, by saying “look at that non-dog”. Certainly, negative examples may be obtained during 
an active learning process — the child says “look at the dog” and the parent says “that’s a cat, 
dear, not a dog” — but psychological research has shown that people can learn concepts from 
positive examples alone (Xu and Tenenbaum 2007). 

We can think of learning the meaning of a word as equivalent to concept learning, which in 
turn is equivalent to binary classification. To see this, define f(x) = 1 if x is an example of the 
concept C, and f(x) = 0 otherwise. Then the goal is to learn the indicator function f, which 
just defines which elements are in the set C. By allowing for uncertainty about the definition 
of f, or equivalently the elements of C, we can emulate fuzzy set theory, but using standard 
probability calculus. Note that standard binary classification techniques require positive and 
negative examples. By contrast, we will devise a way to learn from positive examples alone. 

For pedagogical purposes, we will consider a very simple example of concept learning called 
the number game, based on part of Josh Tenenbaum’s PhD thesis (Tenenbaum 1999). The game 
proceeds as follows. I choose some simple arithmetical concept C, such as “prime number” or 
“a number between 1 and 10”. I then give you a series of randomly chosen positive examples 
D = {z1,..., £y } drawn from C, and ask you whether some new test case & belongs to C, 
i.e., I ask you to classify Z. 
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Figure 3.1 Empirical predictive distribution averaged over 8 humans in the number game. First two 
rows: after seeing D = {16} and D = {60}. This illustrates diffuse similarity. Third row: after 
seeing D = {16,8,2,64}. This illustrates rule-like behavior (powers of 2). Bottom row: after seeing 

= {16,23,19,20}. This illustrates focussed similarity (numbers near 20). Source: Figure 5.5 of 
(Tenenbaum 1999). Used with kind permission of Josh Tenenbaum. 


Suppose, for simplicity, that all numbers are integers between 1 and 100. Now suppose I tell 
you “16” is a positive example of the concept. What other numbers do you think are positive? 
17? 6? 32? 99? It’s hard to tell with only one example, so your predictions will be quite vague. 
Presumably numbers that are similar in some sense to 16 are more likely. But similar in what 
way? 17 is similar, because it is “close by”, 6 is similar because it has a digit in common, 
32 is similar because it is also even and a power of 2, but 99 does not seem similar. Thus 
some numbers are more likely than others. We can represent this as a probability distribution, 
p(&|D), which is the probability that € C given the data D for any % € {1,..., 100}. This 
is called the posterior predictive distribution. Figure 3.1(top) shows the predictive distribution 
of people derived from a lab experiment. We see that people predict numbers that are similar 
to 16, under a variety of kinds of similarity. 

Now suppose I tell you that 8, 2 and 64 are also positive examples. Now you may guess that 
the hidden concept is “powers of two”. This is an example of induction. Given this hypothesis, 
the predictive distribution is quite specific, and puts most of its mass on powers of 2, as shown 
in Figure 3.1(third row). If instead I tell you the data is D = {16,23,19, 20}, you will get a 
different kind of generalization gradient, as shown in Figure 3.1(bottom). 

How can we explain this behavior and emulate it in a machine? The classic approach to 
induction is to suppose we have a hypothesis space of concepts, H, such as: odd numbers, 
even numbers, all numbers between 1 and 100, powers of two, all numbers ending in j (for 
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0 < j <9), etc. The subset of H that is consistent with the data D is called the version space. 
As we see more examples, the version space shrinks and we become increasingly certain about 
the concept (Mitchell 1997). 

However, the version space is not the whole story. After seeing D = {16}, there are many 
consistent rules; how do you combine them to predict if č € C? Also, after seeing D = 
{16, 8, 2,64}, why did you choose the rule “powers of two” and not, say, “all even numbers”, or 
“powers of two except for 32”, both of which are equally consistent with the evidence? We will 
now provide a Bayesian explanation for this. 


Likelihood 


We must explain why we chose hiwo “powers of two”, and not, say, heven = “even numbers” 
after seeing D = {16,8,2,64}, given that both hypotheses are consistent with the evidence. 
The key intuition is that we want to avoid suspicious coincidences. If the true concept was 
even numbers, how come we only saw numbers that happened to be powers of two? 

To formalize this, let us assume that examples are sampled uniformly at random from the 
extension of a concept. (The extension of a concept is just the set of numbers that belong 
to it, e.g, the extension of heven is {2,4,6,...,98, 100}; the extension of “numbers ending 
in 9” is {9,19,...,99}.) Tenenbaum calls this the strong sampling assumption. Given this 
assumption, the probability of independently sampling N items (with replacement) from h is 
given by 


rom- (scm) = Li) i 


This crucial equation embodies what Tenenbaum calls the size principle, which means the 
model favors the simplest (smallest) hypothesis consistent with the data. This is more commonly 
known as Occam’s razor.! 

To see how it works, let D = {16}. Then p(D|htwo) = 1/6, since there are only 6 powers 
of two less than 100, but p(D|heven) = 1/50, since there are 50 even numbers. So the 
likelihood that h = hiwo is higher than if h = heven. After 4 examples, the likelihood of hiwo 
is (1/6)* = 7.7 x 1074, whereas the likelihood of heven is (1/50)* = 1.6 x 1077. This is 
a likelihood ratio of almost 5000:1 in favor of hiwo. This quantifies our earlier intuition that 
D = {16,8, 2,64} would be a very suspicious coincidence if generated by Reven- 


Prior 


Suppose D = {16,8,2,64}. Given this data, the concept h’ =“powers of two except 32” is 
more likely than h =“powers of two”, since h’ does not need to explain the coincidence that 32 
is missing from the set of examples. 

However, the hypothesis h’ =“powers of two except 32” seems “conceptually unnatural’. We 
can capture such intution by assigning low prior probability to unnatural concepts. Of course, 
your prior might be different than mine. This subjective aspect of Bayesian reasoning is a 
source of much controversy, since it means, for example, that a child and a math professor 


1. William of Occam (also spelt Ockham) was an English monk and philosopher, 1288-1348. 
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68 Chapter 3. Generative models for discrete data 


will reach different answers. In fact, they presumably not only have different priors, but also 
different hypothesis spaces. However, we can finesse that by defining the hypothesis space of 
the child and the math professor to be the same, and then setting the child’s prior weight to be 
zero on certain “advanced” concepts. Thus there is no sharp distinction between the prior and 
the hypothesis space. 

Although the subjectivity of the prior is controversial, it is actually quite useful. If you are 
told the numbers are from some arithmetic rule, then given 1200, 1500, 900 and 1400, you may 
think 400 is likely but 1183 is unlikely. But if you are told that the numbers are examples of 
healthy cholesterol levels, you would probably think 400 is unlikely and 1183 is likely. Thus we 
see that the prior is the mechanism by which background knowledge can be brought to bear on 
a problem. Without this, rapid learning (i.e., from small samples sizes) is impossible. 

So, what prior should we use? For illustration purposes, let us use a simple prior which 
puts uniform probability on 30 simple arithmetical concepts, such as “even numbers”, “odd 
numbers”, “prime numbers”, “numbers ending in 9”, etc. To make things more interesting, we 
make the concepts even and odd more likely apriori. We also include two “unnatural” concepts, 
namely “powers of 2, plus 37” and “powers of 2, except 32”, but give them low prior weight. See 
Figure 3.2(a) for a plot of this prior. We will consider a slightly more sophisticated prior later on. 


Posterior 
The posterior is simply the likelihood times the prior, normalized. In this context we have 


P(D|h)p(h) p(h)I(D € h)/|h|% 
wen PPh) ven PCR ID € h’)/|h’|% 


where I(D € h) is 1 iff (iff and only if) all the data are in the extension of the hypothesis 
h. Figure 3.2 plots the prior, likelihood and posterior after seeing D = {16}. We see that the 
posterior is a combination of prior and likelihood. In the case of most of the concepts, the prior 
is uniform, so the posterior is proportional to the likelihood. However, the “unnatural” concepts 
of “powers of 2, plus 37” and “powers of 2, except 32” have low posterior support, despite having 
high likelihood, due to the low prior. Conversely, the concept of odd numbers has low posterior 
support, despite having a high prior, due to the low likelihood. 

Figure 3.3 plots the prior, likelihood and posterior after seeing D = {16,8, 2,64}. Now the 
likelihood is much more peaked on the powers of two concept, so this dominates the posterior. 
Essentially the learner has an aha moment, and figures out the true concept. (Here we see the 
need for the low prior on the unnatural concepts, otherwise we would have overfit the data and 
picked “powers of 2, except for 32”.) 

In general, when we have enough data, the posterior p(h|D) becomes peaked on a single 
concept, namely the MAP estimate, i.e., 


p(h|D) = (3.3) 


P(h|D) > dpmar(h) (3.4) 


where AMAP = argmax,, p(h|D) is the posterior mode, and where ô is the Dirac measure 


defined by 
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Figure 3.2 Prior, likelihood and posterior for D = {16}. Based on (Tenenbaum 1999). Figure generated 
by numbersGame. 


Note that the MAP estimate can be written as 
AMAP — aremax p(D|h)p(h) = argmax [log p(D|h) + log p(h)] (3.6) 
h h 


Since the likelihood term depends exponentially on NV, and the prior stays constant, as we get 
more and more data, the MAP estimate converges towards the maximum likelihood estimate 


or MLE: 
ne £ argmax p(D|h) = argmax log p(D|h) (3.7) 
h h 


In other words, if we have enough data, we see that the data overwhelms the prior. In this 


70 Chapter 3. Generative models for discrete data 


data=16 8 2 64 
35 T 35 


even 
odd 
squares 307 730 F 4 
mult of 3 
mult of 4 
mult of 5 
mult of 6 
mult of 7 257 74257 -7 
mult of 8 
mult of 9 
mult of 10 
ends in 1 
ends in 2 20F 720; a 
ends in 3 
ends in 4 
ends in 5 
ends in 6 
ends in 7 15- 15 + 
ends in 8 
ends in 9 
powers of 2 = ei] 
powers of 3 
powers of 4 10F “IO igi 
powers of 5 
powers of 6 
powers of 7 
powers of 8 
powers of 9 57 7 57 4 
powers of 10} 
all 


powers of 2 + {37 — 
powers of 2 — {32 


0 
0 0.1 0.2 0 1 20 0.5 1 


Figure 3.3 Prior, likelihood and posterior for D = {16,8, 2,64}. Based on (Tenenbaum 1999). Figure 
generated by numbersGame. 


case, the MAP estimate converges towards the MLE. 

If the true hypothesis is in the hypothesis space, then the MAP/ ML estimate will converge 
upon this hypothesis. Thus we say that Bayesian inference (and ML estimation) are consistent 
estimators (see Section 6.4.1 for details). We also say that the hypothesis space is identifiable in 
the limit, meaning we can recover the truth in the limit of infinite data. If our hypothesis class 
is not rich enough to represent the “truth” (which will usually be the case), we will converge 
on the hypothesis that is as close as possible to the truth. However, formalizing this notion of 
“closeness” is beyond the scope of this chapter. 
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0 0.5 1 
p(h | 16) 


Figure 3.4 Posterior over hypotheses and the corresponding predictive distribution after seeing one 
example, D = {16}. A dot means this number is consistent with this hypothesis. The graph p(h|D) on 
the right is the weight given to hypothesis h. By taking a weighed sum of dots, we get p(% € C|D) (top). 
Based on Figure 2.9 of (Tenenbaum 1999). Figure generated by numbersGame. 


Posterior predictive distribution 


The posterior is our internal belief state about the world. The way to test if our beliefs are 
justified is to use them to predict objectively observable quantities (this is the basis of the 
scientific method). Specifically, the posterior predictive distribution in this context is given by 


p(ē € CID) = X` ply = 1, h)p(hID) (3.8) 
h 


This is just a weighted average of the predictions of each individual hypothesis and is called 
Bayes model averaging (Hoeting et al. 1999). This is illustrated in Figure 3.4. The dots at the 
bottom show the predictions from each hypothesis; the vertical curve on the right shows the 
weight associated with each hypothesis. If we multiply each row by its weight and add up, we 
get the distribution at the top. 

When we have a small and/or ambiguous dataset, the posterior p(h|D) is vague, which 
induces a broad predictive distribution. However, once we have “figured things out”, the posterior 
becomes a delta function centered at the MAP estimate. In this case, the predictive distribution 
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becomes 


p(é € CID) = X` p(#|h)6;,(h) = p(&|h) (3.9) 
h 


This is called a plug-in approximation to the predictive density and is very widely used, due 
to its simplicity. However, in general, this under-represents our uncertainty, and our predictions 
will not be as “smooth” as when using BMA. We will see more examples of this later in the book. 

Although MAP learning is simple, it cannot explain the gradual shift from similarity-based 
reasoning (with uncertain posteriors) to rule-based reasoning (with certain posteriors). For 
example, suppose we observe D = {16}. If we use the simple prior above, the minimal 
consistent hypothesis is “all powers of 4”, so only 4 and 16 get a non-zero probability of being 
predicted. This is of course an example of overfitting. Given D = {16,8,2,64}, the MAP 
hypothesis is “all powers of two’. Thus the plug-in predictive distribution gets broader (or stays 
the same) as we see more data: it starts narrow, but is forced to broaden as it seems more data. 
In contrast, in the Bayesian approach, we start broad and then narrow down as we learn more, 
which makes more intuitive sense. In particular, given D = {16}, there are many hypotheses 
with non-negligible posterior support, so the predictive distribution is broad. However, when we 
see D = {16,8, 2, 64}, the posterior concentrates its mass on one hypothesis, so the predictive 
distribution becomes narrower. So the predictions made by a plug-in approach and a Bayesian 
approach are quite different in the small sample regime, although they converge to the same 
answer as we see more data. 


A more complex prior 


To model human behavior, Tenenbaum used a slightly more sophisticated prior which was de- 
rived by analysing some experimental data of how people measure similarity between numbers; 
see (Tenenbaum 1999, p208) for details. The result is a set of arithmetical concepts similar to 
those mentioned above, plus all intervals between n and m for 1 < n,m < 100. (Note that 
these hypotheses are not mutually exclusive.) Thus the prior is a mixture of two priors, one 
over arithmetical rules, and one over intervals: 


P(h) = T0Prules (2) + (1 — 70)Pinterval (”) (3.10) 


The only free parameter in the model is the relative weight, 79, given to these two parts of the 
prior. The results are not very sensitive to this value, so long as mo > 0.5, reflecting the fact 
that people are more likely to think of concepts defined by rules. The predictive distribution 
of the model, using this larger hypothesis space, is shown in Figure 3.5. It is strikingly similar 
to the human predictive distribution, shown in Figure 3.1, even though it was not fit to human 
data (modulo the choice of hypothesis space). 


The beta-binomial model 


The number game involved inferring a distribution over a discrete variable drawn from a finite 
hypothesis space, h € H, given a series of discrete observations. This made the computations 
particularly simple: we just needed to sum, multiply and divide. However, in many applications, 
the unknown parameters are continuous, so the hypothesis space is (some subset) of R*, where 
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Figure 3.5 Predictive distributions for the model using the full hypothesis space. Compare to Figure 3.1. 
The predictions of the Bayesian model are only plotted for those values of č for which human data is 
available; this is why the top line looks sparser than Figure 3.4. Source: Figure 5.6 of (Tenenbaum 1999). 
Used with kind permission of Josh Tenenbaum. 


K is the number of parameters. This complicates the mathematics, since we have to replace 
sums with integrals. However, the basic ideas are the same. 

We will illustrate this by considering the problem of inferring the probability that a coin shows 
up heads, given a series of observed coin tosses. Although this might seem trivial, it turns out 
that this model forms the basis of many of the methods we will consider later in this book, 
including naive Bayes classifiers, Markov models, etc. It is historically important, since it was the 
example which was analyzed in Bayes’ original paper of 1763. (Bayes’ analysis was subsequently 
generalized by Pierre-Simon Laplace, creating what we now call “Bayes rule” — see (Stigler 1986) 
for further historical details.) 

We will follow our now-familiar recipe of specifying the likelihood and prior, and deriving the 
posterior and posterior predictive. 


Likelihood 


Suppose X; ~ Ber(@), where X; = 1 represents “heads”, X; = 0 represents “tails”, and 
0 € [0,1] is the rate parameter (probability of heads). If the data are iid, the likelihood has the 
form 


p(D|0) = 0™ (1 — 9) No (3.11) 
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where we have Ny = pee I(x; = 1) heads and No = ~ I(x; = 0) tails. These two counts 
are called the sufficient statistics of the data, since this is all we need to know about D to 
infer 9. (An alternative set of sufficient statistics are Ny; and N = No + Nj.) 

More formally, we say s(D) is a sufficient statistic for data D if p(@|D) = p(@|s(data)). If 
we use a uniform prior, this is equivalent to saying p(D|@ œ p(s(D)|@). Consequently, if we 
have two datasets with the same sufficient statistics, we will infer the same value for 0. 

Now suppose the data consists of the count of the number of heads N; observed in a fixed 
number N = N; + No of trials. In this case, we have Nı ~ Bin(N, 6), where Bin represents 
the binomial distribution, which has the following pmf: 


Bin(k|n, 0) ê 6 oF (1 —9)"-* (3.12) 
k 


same as the likelihood for the Bernoulli model. So any inferences we make about 0 will be the 
same whether we observe the counts, D = (N1, N), or a sequence of trials, D = {x1,..., vy}. 


Since J is a constant independent of 0, the likelihood for the binomial sampling model is the 


Prior 


We need a prior which has support over the interval [0,1]. To make the math easier, it would 
convenient if the prior had the same form as the likelihood, i.e., if the prior looked like 


p(0) x 67 (1 — 6)? (3.13) 


for some prior parameters yı and y2. If this were the case, then we could easily evaluate the 
posterior by simply adding up the exponents: 


p(0) x p(D|O)p(6) = 0™ (1 — 0)N09% (1 — 9)? = 1+7: (1 — Q) Notre (3.14) 


When the prior and the posterior have the same form, we say that the prior is a conjugate 
prior for the corresponding likelihood. Conjugate priors are widely used because they simplify 
computation, and are easy to interpret, as we see below. 

In the case of the Bernoulli, the conjugate prior is the beta distribution, which we encountered 
in Section 2.4.5: 


Beta(@|a, b) x 67 1(1—)?"1 (3.15) 


The parameters of the prior are called hyper-parameters. We can set them in order to encode 
our prior beliefs. For example, to encode our beliefs that 0 has mean 0.7 and standard deviation 
0.2, we set a = 2.975 and b = 1.275 (Exercise 3.15). Or to encode our beliefs that 0 has mean 
0.15 and that we think it lives in the interval (0.05, 0.30) with probability, then we find a = 4.5 
and b = 25.5 (Exercise 3.16). 

If we know “nothing” about 6, except that it lies in the interval [0,1], we can use a uni- 
form prior, which is a kind of uninformative prior (see Section 5.4.2 for details). The uniform 
distribution can be represented by a beta distribution with a = b = 1. 
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Figure 3.6 (a) Updating a Beta(2, 2) prior with a Binomial likelihood with sufficient statistics Ny = 
3, No = 17 to yield a Beta(5,19) posterior. (b) Updating a Beta(5, 2) prior with a Binomial likeli- 
hood with sufficient statistics Ni; = 11,No = 13 to yield a Beta(16, 15) posterior. Figure generated by 
binomialBetaPosteriorDemo. 


Posterior 


If we multiply the likelihood by the beta prior we get the following posterior (following Equa- 
tion 3.14): 


p(O|D) œx Bin(Nı|0, No + Ni)Beta(O\a, b)Beta(6|Ni + a, No + b) (3.16) 


In particular, the posterior is obtained by adding the prior hyper-parameters to the empirical 
counts. For this reason, the hyper-parameters are known as pseudo counts. The strength of the 
prior, also known as the effective sample size of the prior, is the sum of the pseudo counts, 
a + b; this plays a role analogous to the data set size, Ny + No = N. 

Figure 3.6(a) gives an example where we update a weak Beta(2,2) prior with a peaked likelihood 
function, corresponding to a large sample size; we see that the posterior is essentially identical 
to the likelihood: since the data has overwhelmed the prior. Figure 3.6(b) gives an example 
where we update a strong Beta(5,2) prior with a peaked likelihood function; now we see that the 
posterior is a “compromise” between the prior and likelihood. 

Note that updating the posterior sequentially is equivalent to updating in a single batch. 
To see this, suppose we have two data sets Da and D, with sufficient statistics Nj’, Nj and 
NP, Nè. Let Ny = N? + N? and No = NẸ + NÈ be the sufficient statistics of the combined 
datasets. In batch mode we have 


P(O|Da, Pp) œ Bin(N,|0, Ni + No)Beta(6|a, b) x Beta(0| N1 + a, No + b) (3.17) 


In sequential mode, we have 


P(O|Da, Pv) x p(Dv|9)p(O|Da) (3.18) 
x Bin(N?|6,N? + Nè)Beta(0| NE +a, NE +b) (3.19) 
x Beta(6| N? + N? +a, NE + NÈ +b) (3.20) 


This makes Bayesian inference particularly well-suited to online learning, as we will see later. 
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Posterior mean and mode 


From Equation 2.62, the MAP estimate is given by 


a a+ Ni -1 
0 = ——_— 3.21 
a a+b+N-2 oS 
If we use a uniform prior, then the MAP estimate reduces to the MLE, which is just the empirical 


fraction of heads: 


a N: 
OMLE = <a (3.22) 


This makes intuitive sense, but it can also be derived by applying elementary calculus to 
maximize the likelihood function in Equation 3.11. (Exercise 3.1). 
By contrast, the posterior mean is given by, 
= a + N- 1 
0 = — 3.23 
a+b+N ae) 
This difference between the mode and the mean will prove important later. 

We will now show that the posterior mean is convex combination of the prior mean and the 
MLE, which captures the notion that the posterior is a compromise between what we previously 
believed and what the data is telling us. 

Let ag = a + b be the equivalent sample size of the prior, which controls its strength, and 
let the prior mean be mı = a/ao. Then the posterior mean is given by 
Qomy, + N- 1 Qo N N 1 


s/A\D] = = , = 1—A)6 3.24 
A l N +ao N+ O Nya N m1 + ( )OuLe ( ) 
ao 


where A = 75° is the ratio of the prior to posterior equivalent sample size. So the weaker the 


prior, the smaller is \, and hence the closer the posterior mean is to the MLE. One can show 
similarly that the posterior mode is a convex combination of the prior mode and the MLE, and 
that it too converges to the MLE. 


Posterior variance 


The mean and mode are point estimates, but it is useful to know how much we can trust them. 
The variance of the posterior is one way to measure this. The variance of the Beta posterior is 
given by 


N,)(b+ N 
var [0|D] = aE Nee No) (3.25) 
(a+ Ni +b6+ No)?(at+ Ni + b+ No +1) 
We can simplify this formidable expression in the case that N > a, b, to get 
NiNo _ 011-8 
var [6|D] ~ = = =) (3.26) 


NNN N 
where @ is the MLE. Hence the “error bar” in our estimate (Le., the posterior standard deviation), 
is given by 


o = \/var [6|D] = “ro (3.27) 
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We see that the uncertainty goes down at a rate of 1/\/N. Note, however, that the uncertainty 
(variance) is maximized when 0 = 0.5, and is minimized when @ is close to 0 or 1. This means 
it is easier to be sure that a coin is biased than to be sure that it is fair. 


Posterior predictive distribution 


So far, we have been focusing on inference of the unknown parameter(s). Let us now turn our 
attention to prediction of future observable data. 

Consider predicting the probability of heads in a single future trial under a Beta(a, b) poste- 
rior. We have 


II 


p(z = 1|D) f plx = 1|0)p(0|D) de (3.28) 


a 


a+b 


Thus we see that the mean of the posterior predictive distribution is equivalent (in this case) to 
plugging in the posterior mean parameters: p(%|D) = Ber(2|E [0|D)). 


(3.29) 


1 
= J. 0 Beta(bla,b)a0 = 3 [6|D] = 
0 


Overfitting and the black swan paradox 


Suppose instead that we plug-in the MLE, i.e., we use p(z|D) ~ Ber(#|\7,2). Unfortunately, 
this approximation can perform quite poorly when the sample size is small. For example, 
suppose we have seen N = 3 tails in a row. The MLE is 6 = 0/3 = 0, since this makes the 
observed data as probable as possible. However, using this estimate, we predict that heads are 
impossible. This is called the zero count problem or the sparse data problem, and frequently 
occurs when estimating counts from small amounts of data. One might think that in the era 
of “big data’, such concerns are irrelevant, but note that once we partition the data based on 
certain criteria — such as the number of times a specific person has engaged in a specific activity 
— the sample sizes can become much smaller. This problem arises, for example, when trying 
to perform personalized recommendation of web pages. Thus Bayesian methods are still useful, 
even in the big data regime Jordan 2011). 

The zero-count problem is analogous to a problem in philosophy called the black swan 
paradox. This is based on the ancient Western conception that all swans were white. In 
that context, a black swan was a metaphor for something that could not exist. (Black swans 
were discovered in Australia by European explorers in the 17th Century.) The term “black swan 
paradox” was first coined by the famous philosopher of science Karl Popper; the term has also 
been used as the title of a recent popular book (Taleb 2007). This paradox was used to illustrate 
the problem of induction, which is the problem of how to draw general conclusions about the 
future from specific observations from the past. 

Let us now derive a simple Bayesian solution to the problem. We will use a uniform prior, so 
a = b= 1. In this case, plugging in the posterior mean gives Laplace’s rule of succession 


z Nı +1 
&=1|D)= 3.30 
M= IP) = FN Fe oe 
This justifies the common practice of adding 1 to the empirical counts, normalizing and then 


plugging them in, a technique known as add-one smoothing. (Note that plugging in the MAP 
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Nı+a—1 


parameters would not have this smoothing effect, since the mode has the form ĝ = NoN 


which becomes the MLE if a = b = 1.) 


Predicting the outcome of multiple future trials 


Suppose now we were interested in predicting the number of heads, x, in M future trials. This 
is given by 


1 
p(z|D,M) = J Bin(zl0, M)Beta(0la, bad (3.31) 
0 


M 1 i x M—-—zpa-1 b-1 
@ioot 67 (1 — 6)M—92-1(4 — 9)'-1a9 (3.32) 


We recognize the integral as the normalization constant for a Beta(a+a, M —x+b) distribution. 
Hence 


1 
f o” (1 — 0) -*92-1 (1 — 6)? 1dé = B(x +a, M — x +b) (3.33) 
0 


Thus we find that the posterior predictive is given by the following, known as the (compound) 
beta-binomial distribution: 


M\ B(x+a,M — z +b) 
Bb(x|a,b, M) £ : 3.34 
easan e (2) et a 
This distribution has the following mean and variance 
a Mab (a+b+ M) 
[e] a+b’ v (a+b)? a+b+1 aa 
If M = 1, and hence x € {0, 1}, we see that the mean becomes E [x|D] = p(x = 1|D) = 355 


which is consistent with Equation 3.29. 

This process is illustrated in Figure 3.7(a). We start with a Beta(2,2) prior, and plot the 
posterior predictive density after seeing Ni; = 3 heads and No = 17 tails. Figure 3.7(b) plots 
a plug-in approximation using a MAP estimate. We see that the Bayesian prediction has longer 
tails, spreading its probablity mass more widely, and is therefore less prone to overfitting and 
blackswan type paradoxes. 


The Dirichlet-multinomial model 


In the previous section, we discussed how to infer the probability that a coin comes up heads. 
In this section, we generalize these results to infer the probability that a dice with K sides 
comes up as face k. This might seem like another toy exercise, but the methods we will study 
are widely used to analyse text data, biosequence data, etc., as we will see later. 
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Figure 3.7 (a) Posterior predictive distributions after seeing N; = 3, No = 17. (b) Plugin approximation. 
Figure generated by betaBinomPostPredDemo. 


3.4.1 Likelihood 


Suppose we observe N dice rolls, D = {x1,...,un}, where x; € {1,..., K}. If we assume 
the data is iid, the likelihood has the form 
K 
pD) = [Io (3.36) 
k=1 


where Nk = SG I(y; = k) is the number of times event k occured (these are the sufficient 
statistics for this model). The likelihood for the multinomial model has the same form, up to an 
irrelevant constant factor. 


3.4.2 Prior 


Since the parameter vector lives in the K-dimensional probability simplex, we need a prior that 
has support over this simplex. Ideally it would also be conjugate. Fortunately, the Dirichlet 
distribution (Section 2.5.4) satisfies both criteria. So we will use the following prior: 
1 K 
: Ap—-1 


3.4.3 Posterior 


Multiplying the likelihood by the prior, we find that the posterior is also Dirichlet: 


p(O|D) œx p(D\A)p(@) (3.38) 
K K 

x [oret a (3.39) 
k=1 k=1 


= Dir(Oja,+M,...,a~% + Nx) (3.40) 


80 Chapter 3. Generative models for discrete data 


We see that the posterior is obtained by adding the prior hyper-parameters (pseudo-counts) az, 
to the empirical counts Nx. 

We can derive the mode of this posterior (i.e., the MAP estimate) by using calculus. However, 
we must enforce the constraint that 57,0, = 1.. We can do this by using a Lagrange 
multiplier. The constrained objective function, or Lagrangian, is given by the log likelihood 
plus log prior plus the constraint: 


0(0,) = So NglogO +S (ar —1) log 0k + (: -> a.) (3.41) 
k k k 


To simplify notation, we define Nj, = Nj, + ap — 1. Taking derivatives with respect to A yields 
the original constraint: 


oe 
ie (: - a] =0 (3.42) 
Taking derivatives with respect to 6; yields 


oe Ni. 


— = ~k_)j=Q9 3.43 
obk Ok (aa 
N, = Abk (3.44) 


We can solve for \ using the sum-to-one constraint: 


5 N = a 5 b; (3.45) 
k k 
N+ajo-K =À (3.46) 


where ap = D ax is the equivalent sample size of the prior. Thus the MAP estimate is 
given by 


ga alk als (3.47) 
N+ajo-—K 
which is consistent with Equation 2.77. If we use a uniform prior, a, = 1, we recover the MLE: 
6, = Nk /N (3.48) 
This is just the empirical fraction of times face k shows up. 
2. We do not need to explicitly enforce the constraint that 0;, > O since the gradient of the objective has the form 


Nx /0% — A; so negative values would reduce the objective, rather than maximize it. (Of course, this does not preclude 
setting 0;, = 0, and indeed this is the optimal solution if Nọ = 0 and a; = 1. 
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Posterior predictive 


The posterior predictive distribution for a single multinoulli trial is given by the following 
expression: 


P(X = j|D) = f p(X = 5)6)(o\D)a9 849 
œj + Nj a; +N; 

= 6;p(0;|D)dé; = E [0;|D] = E 2 = E 51 

I jp( zl )dðj (8; | (an + Nx) ag +N (3 ) 


where @_,; are all the components of @ except 0j. See also Exercise 3.13. 

The above expression avoids the zero-count problem, just as we saw in Section 3.3.4.1. In 
fact, this form of Bayesian smoothing is even more important in the multinomial case than the 
binary case, since the likelihood of data sparsity increases once we start partitioning the data 
into many categories. 


Worked example: language models using bag of words 


One application of Bayesian smoothing using the Dirichlet-multinomial model is to language 
modeling, which means predicting which words might occur next in a sequence. Here we 
will take a very simple-minded approach, and assume that the ith word, X; € {1,..., K}, is 
sampled independently from all the other words using a Cat(@) distribution. This is called the 
bag of words model. Given a past sequence of words, how can we predict which one is likely 
to come next? 

For example, suppose we observe the following sequence (part of a children’s nursery rhyme): 


Mary had a little lamb, little lamb, little lamb, 
Mary had a little lamb, its fleece as white as snow 


Furthermore, suppose our vocabulary consists of the following words: 


mary lamb little big fleece white black snow rain unk 
1 2 3 4 5 6 7 8 9 10 


Here unk stands for unknown, and represents all other words that do not appear elsewhere on 
the list. To encode each line of the nursery rhyme, we first strip off punctuation, and remove 
any stop words such as “a”, “as”, “the”, etc. We can also perform stemming, which means 
reducing words to their base form, such as stripping off the final s in plural words, or the ing 
from verbs (e.g., running becomes run). In this example, no words need stemming. Finally, we 


replace each word by its index into the vocabulary to get: 


We now ignore the word order, and count how often each word occurred, resulting in a 
histogram of word counts: 
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Token | 1 2 3 4 5 6 K 8 9 10 
Word | mary lamb little big fleece white black snow rain unk 
Count | 2 4 4 0 1l 1 0 1 0 4 


Denote the above counts by N;. If we use a Dir(q@) prior for 0, the posterior predictive is 
just 
aj; +N; 14+ N; 


p(X = j|D) = £(0,|D] = 5 ay +N; 10 17 mae 
j OF j 


If we set a; = 1, we get 
p(X = j|D) = (3/27, 5/27, 5/27, 1/27, 2/27, 2/27, 1/27, 2/27, 1/27, 5/27) (3.53) 


The modes of the predictive distribution are X = 2 (“lamb”) and X = 10 (“unk”). Note that the 
words “big”, “black” and “rain” are predicted to occur with non-zero probability in the future, 
even though they have never been seen before. Later on we will see more sophisticated language 
models. 


Naive Bayes classifiers 


In this section, we discuss how to classify vectors of discrete-valued features, x € {1,...,K 12 A 
where K is the number of values for each feature, and D is the number of features. We will use 
a generative approach. This requires us to specify the class conditional distribution, p(x|y = c). 
The simplest approach is to assume the features are conditionally independent given the class 
label. This allows us to write the class conditional density as a product of one dimensional 
densities: 
D 
paly=¢6)= [pu it) (3.54) 
j=l 
The resulting model is called a naive Bayes classifier (NBC). 

The model is called “naive” since we do not expect the features to be independent, even 
conditional on the class label. However, even if the naive Bayes assumption is not true, it often 
results in classifiers that work well (Domingos and Pazzani 1997). One reason for this is that the 
model is quite simple (it only has O(C D) parameters, for C classes and D features), and hence 
it is relatively immune to overfitting. 

The form of the class-conditional density depends on the type of each feature. We give some 
possibilities below: 


e In the case of real-valued features, we can use the Gaussian distribution: p(x|y = c,0) = 
ee N (aj |{4jes 03e), Where pje is the mean of feature j in objects of class c, and 0%, is its 
variance. 


e In the case of binary features, x; € {0,1}, we can use the Bernoulli distribution: p(x|y = 


c,0) = Mj Ber(£j|ujc) where tje is the probability that feature j occurs in class c. 


This is sometimes called the multivariate Bernoulli naive Bayes model. We will see an 
application of this below. 
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e In the case of categorical features, x; € {1,..., K}, we can model use the multinoulli 
distribution: p(x|y = c,@) = ee Cat(z;|";.), where Hjo is a histogram over the K 
possible values for x; in class c. 


Obviously we can handle other kinds of features, or use different distributional assumptions. 
Also, it is easy to mix and match features of different types. 


Model fitting 


We now discuss how to “train” a naive Bayes classifier. This usually means computing the MLE 
or the MAP estimate for the parameters. However, we will also discuss how to compute the full 
posterior, p(@|D). 


MLE for NBC 


The probability for a single data case is given by 
D(X, yi|0) = p(yi|7r) H (x43|0;) = II qiui=e) II [[ ris l@;-) =? (3.55) 
c y e 


Hence the log-likelihood is given by 


D © 
log p(D|@) = 5n log Te + +5 > X log p(zij|0je) (3.56) 


j=1 c=1 i:y;=c 


We see that this expression decomposes into a series of terms, one concerning m, and DC 
terms containing the 0,.’s. Hence we can optimize all these parameters separately. 
From Equation 3.48, the MLE for the class prior is given by 


vt = — (3.57) 


where Ne = X; I(y; = c) is the number of examples in class c. 

The MLE for the likelihood depends on the type of distribution we choose to use for each 
feature. For simplicity, let us suppose all features are binary, so x;|y = c ~ Ber(6;.). In this 
case, the MLE becomes 


Nye 


je = 53 (3.58) 


It is extremely simple to implement this model fitting procedure: See Algorithm 8 for some 
pseudo-code (and naiveBayesFit for some Matlab code). This algorithm obviously takes 
O(ND) time. The method is easily generalized to handle features of mixed type. This simplicity 
is one reason the method is so widely used. 

Figure 3.8 gives an example where we have 2 classes and 600 binary features, representing the 
presence or absence of words in a bag-of-words model. The plot visualizes the 0. vectors for the 
two classes. The big spike at index 107 corresponds to the word “subject”, which occurs in both 
classes with probability 1. (In Section 3.5.4, we discuss how to “filter out” such uninformative 
features.) 
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Algorithm 3.1: Fitting a naive Bayes classifier to binary features 
1 Ne = 0, Nje = 0; 
2 fori=1:N do 


3 c = y; II Class label of i'th example; 
4 Neve Na+ 1l; 
5 for j = 1 : D do 
6 if z;; = 1 then 
7 | E Nje = Nic +1 
8 îe = WY Oje = W 
Posty) . . . . Posty =2) 
0.9 F 0.9 
0.8 0.8 
07 F 0.7 


(a) 


Figure 3.8 Class conditional densities p(x; = 1|y = c) for two document classes, corresponding to “X 
windows” and “MS windows”. Figure generated by naiveBayesBowDemo. 


Bayesian naive Bayes 


The trouble with maximum likelihood is that it can overfit. For example, consider the example 
in Figure 3.8: the feature corresponding to the word “subject” (call it feature j) always occurs 
in both classes, so we estimate Die = 1. What will happen if we encounter a new email which 
does not have this word in it? Our algorithm will crash and burn, since we will find that 
p(y = clx,@) = 0 for both classes! This is another manifestation of the black swan paradox 
discussed in Section 3.3.4.1. 

A simple solution to overfitting is to be Bayesian. For simplicity, we will use a factored prior: 


D G 
pO) = v(m) [J [[ re) (3.59) 


j=lc=1 


We will use a Dir(q@) prior for m and a Beta(o, 1) prior for each 0;.. Often we just take 
a = 1 and 8 = 1, corresponding to add-one or Laplace smoothing. 
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Combining the factored likelihood in Equation 3.56 with the factored prior above gives the 
following factored posterior: 


DC 
pOID) = p(mlD) [J [[ p;clD) (3.60) 
jel c= 
p(n|D) = Dir(Ni +a1ı..., No +ac) (3.61) 
Plôje|D) = Beta((Ne — Nye) + Bo, Nje + 81) (3.62) 


In other words, to compute the posterior, we just update the prior counts with the empirical 
counts from the likelihood. It is straightforward to modify algorithm 8 to handle this version of 
model “fitting”. 


3.5.2 Using the model for prediction 


At test time, the goal is to compute 
D 
ply = cx, D) x p(y=c|D) ll plz;ly = c, D) (3.63) 


The correct Bayesian procedure is to integrate out the unknown parameters: 


ply =c|x,D) x | / Cat(y = dm)p(a|D)ar| (3.64) 
D 
H / Ber(aj|y = c, BePOP) (3.65) 


Fortunately, this is easy to do, at least if the posterior is Dirichlet. In particular, from Equa- 
tion 3.51, we know the posterior predictive density can be obtained by simply plugging in the 
posterior mean parameters 0. Hence 


D 
ply =clx,D) x Fe | [G] EDA — 8j) (3.66) 
j=1 
F Nje F bı 
bo = e .67 
F Ne+ Bo + Bt a 
—= Ne T Qe 
m S Na i 
0 


where ao = >>, Qe. 

If we have approximated the posterior by a single point, p(@|D) ~ 6g(@), where 6 may be 
the ML or MAP estimate, then the posterior predictive density is obtained by simply plugging in 
the parameters, to yield a virtually identical rule: 


ply =dx, D) x te | [G EPa — Oj) (3.69) 


3.5.3 


3.5.4 


86 Chapter 3. Generative models for discrete data 


The only difference is we replaced the posterior mean Ø with the posterior mode or MLE Ô. 
However, this small difference can be important in practice, since the posterior mean will result 
in less overfitting (see Section 3.4.4.1). 


The log-sum-exp trick 


We now discuss one important practical detail that arises when using generative classifiers of any 
kind. We can compute the posterior over class labels using Equation 2.13, using the appropriate 
class-conditional density (and a plug-in approximation). Unfortunately a naive implementation 
of Equation 2.13 can fail due to numerical underflow. The problem is that p(x|y = c) is often 
a very small number, especially if x is a high-dimensional vector. This is because we require 
that 5°. p(x|y) = 1, so the probability of observing any particular high-dimensional vector is 
small. The obvious solution is to take logs when applying Bayes rule, as follows: 


C 
log p(y = clx) = be— log 2 J (3.70) 
c'=1 
be = logp(xļy =c) + log p(y = c) (3.71) 


However, this requires evaluating the following expression 
log[S> e’e'] = log Soy =¢,x) = log p(x) (3.72) 
g cl 


and we can’t add up in the log domain. Fortunately, we can factor out the largest term, and just 
represent the remaining numbers relative to that. For example, 


log(e~ 7° + e717) = log (e71 (e? + e7")) = log(e® + e~*) — 120 (3.73) 
In general, we have 


log X e” = log bs atl = cs er") 


C Cc 


+B (3.74) 


where B = max, be. This is called the log-sum-exp trick, and is widely used. (See the function 
logsumexp for an implementation.) 

This trick is used in Algorithm 1 which gives pseudo-code for using an NBC to compute 
p(yilxi,@). See naiveBayesPredict for the Matlab code. Note that we do not need the 
log-sum-exp trick if we only want to compute %;, since we can just maximize the unnormalized 
quantity log p(y; = c) + log p(x;|y = c). 


Feature selection using mutual information 


Since an NBC is fitting a joint distribution over potentially many features, it can suffer from 
overfitting. In addition, the run-time cost is O(D), which may be too high for some applications. 

One common approach to tackling both of these problems is to perform feature selection, to 
remove “irrelevant” features that do not help much with the classification problem. The simplest 
approach to feature selection is to evaluate the relevance of each feature separately, and then 


3.5.5 
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Algorithm 3.2: Predicting with a naive bayes classifier for binary features 
1 fori =1: N do 
forc=1:C do 
Lic = log Tes 
for j = 1 : D do 
p if Tij = 1 then Lic = Lie + log 6c else Lie = Lic + log(1 = A) 


a AeA O N 


6 Dic = exp( Lic — logsumexp(L;,:)); 
7 | ĝi = argmax, Pici 


take the top K, where K is chosen based on some tradeoff between accuracy and complexity. 
This approach is known as variable ranking, filtering, or screening. 

One way to measure relevance is to use mutual information (Section 2.8.3) between feature 
X; and the class label Y: 


= Ge PCi y) 
7 2 dv cae p(xj)p(y) (3.75) 


The mutual information can be thought of as the reduction in entropy on the label distribution 
once we observe the value of feature j. If the features are binary, it is easy to show (Exercise 3.21) 
that the MI can be computed as follows 
0; 1-6; 
L => [osere log a + (1 = je) log — d, (3.76) 


e 


where Te = p(y = c), bje = plz; = 1|y = c), and 6; = p(z; = 1) = 90, Teje. (All of these 
quantities can be computed as a by-product of fitting a naive Bayes classifier.) 

Figure 3.1 illustrates what happens if we apply this to the binary bag of words dataset used in 
Figure 3.8. We see that the words with highest mutual information are much more discriminative 
than the words which are most probable. For example, the most probable word in both classes 
is “subject”, which always occurs because this is newsgroup data, which always has a subject 
line. But obviously this is not very discriminative. The words with highest MI with the class 
label are (in decreasing order) “windows”, “microsoft”, “DOS” and “motif”, which makes sense, 
since the classes correspond to Microsoft Windows and X Windows. 


Classifying documents using bag of words 


Document classification is the problem of classifying text documents into different categories. 
One simple approach is to represent each document as a binary vector, which records whether 
each word is present or not, so xij = 1 iff word j occurs in document i, otherwise x;; = 0. 
We can then use the following class conditional density: 


D 
p(xilyi = ¢, 8) = [| Ber(aij|6j-) = I e 8.7) 
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class1 prob | class2 prob || highest MI MI 

subject 0.998 | subject 0.998 || windows 0.215 
this 0.628 | windows 0.639 || microsoft 0.095 
with 0.535 this 0.540 dos 0.092 
but 0.471 with 0.538 motif 0.078 
you 0.431 but 0.518 window 0.067 


Table 3.1 We list the 5 most likely words for class 1 (X windows) and class 2 (MS windows). We also show 
the 5 words with highest mutual information with class label. Produced by naiveBayesBowDemo 


This is called the Bernoulli product model, or the binary independence model. 

However, ignoring the number of times each word occurs in a document loses some in- 
formation (McCallum and Nigam 1998). A more accurate representation counts the number 
of occurrences of each word. Specifically, let x; be a vector of counts for document i, so 
xij E€ {0,1,..., Ni}, where N; is the number of terms in document 7 (so ea xij = Nj). For 
the class conditional densities, we can use a multinomial distribution: 


D 
Ni! gi 


D ! je 
H= Tij: j=1 


where we have implicitly assumed that the document length N; is independent of the class. 
Here 0; is the probability of generating word j in documents of class c; these parameters satisfy 
the constraint that ae je = 1 for each class c.’ 

Although the multinomial classifier is easy to train and easy to use at test time, it does not 
work particularly well for document classification. One reason for this is that it does not take 
into account the burstiness of word usage. This refers to the phenomenon that most words 
never appear in any given document, but if they do appear once, they are likely to appear more 
than once, i.e., words occur in bursts. 

The multinomial model cannot capture the burstiness phenomenon. To see why, note that 


p(xilyi = c, 0) = Mu(x;,|.N;, 0e) = (3.78) 


Equation 3.78 has the form Oi and since 0;. < 1 for rare words, it becomes increasingly 
unlikely to generate many of them. For more frequent words, the decay rate is not as fast. To 
see why intuitively, note that the most frequent words are function words which are not specific 
to the class, such as “and”, “the”, and “but”; the chance of the word “and” occuring is pretty 
much the same no matter how many time it has previously occurred (modulo document length), 
so the independence assumption is more reasonable for common words. However, since rare 
words are the ones that matter most for classification purposes, these are the ones we want to 
model the most carefully. 

Various ad hoc heuristics have been proposed to improve the performance of the multinomial 
document classifier (Rennie et al. 2003). We now present an alternative class conditional density 
that performs as well as these ad hoc methods, yet is probabilistically sound (Madsen et al. 
2005). 


3. Since Equation 3.78 models each word independently, this model is often called a naive Bayes classifier, although 
technically the features x;; are not independent, because of the constraint X- jTij = Nj. 
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Suppose we simply replace the multinomial class conditional density with the Dirichlet 
Compound Multinomial or DCM density, defined as follows: 


(3.79) 
M Tij! Bla.) 


P(xilyi=Cc,a) = J MuGali.0.)Din(@,|a.)d9. = 


(This equation is derived in Equation 5.24.) Surprisingly this simple change is all that is needed 
to capture the burstiness phenomenon. The intuitive reason for this is as follows: After seeing 
one occurence of a word, say word j, the posterior counts on 0; gets updated, making another 
occurence of word j more likely. By contrast, if 0; is fixed, then the occurences of each word are 
independent. The multinomial model corresponds to drawing a ball from an urn with K colors 
of ball, recording its color, and then replacing it. By contrast, the DCM model corresponds to 
drawing a ball, recording its color, and then replacing it with one additional copy; this is called 
the Polya urn. 

Using the DCM as the class conditional density gives much better results than using the 
multinomial, and has performance comparable to state of the art methods, as described in 
(Madsen et al. 2005). The only disadvantage is that fitting the DCM model is more complex; see 
(Minka 2000e; Elkan 2006) for the details. 


Exercises 


Exercise 3.1 MLE for the Bernoulli/ binomial model 
Derive Equation 3.22 by optimizing the log of the likelihood in Equation 3.11. 


Exercise 3.2 Marginal likelihood for the Beta-Bernoulli model 


In Equation 5.23, we showed that the marginal likelihood is the ratio of the normalizing constants: 


Z(ai+Ni,a0+No) _ P(a1 + Ni)E (ao + No) T(ai + a0) 


D (3.80) 
PY) Zan ao) Tien +o tN) Tlœ (ao) 
We will now derive an alternative derivation of this fact. By the chain rule of probability, 
p(zr:n) = p(x1)p(z2|x1)p(z3|x1:2) . (3.81) 
In Section 3.3.4, we showed that the posterior predictive distribution is 
Nr +ak a Ne+or 
X =k|Di:n) = = 3.82 
p( T i 


where k € {0, 1} and D1:vn is the data seen so far. Now suppose D = H,T,T, H, H or D = 1,0,0, 1,1. 
Then 


ai ao aotl atl @ı+2 


MO = a apl ee ai (3.83) 
[ai (arı + 1)(a1 + 2)] [ao(@o + 1)] 
(a1) -- (a1 + Ni — 1)] [(ao0) -+ - (ao + No — 1)] na 


(a):-(a+N—1) 


Show how this reduces to Equation 3.80 by using the fact that, for integers, (œ — 1)! = T (a). 
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Exercise 3.3 Posterior predictive for Beta-Binomial model 


Recall from Equation 3.32 that the posterior predictive for the Beta-Binomial is given by 


p(aln,D) = Bb(z|ag, ai, n) (8.86) 
i 1 
T o ea (2) (3.87) 
Bla, ag) x 

Prove that this reduces to 

/ 

á Qı 
= 11D) = 3.88 
p(ž = 1|D) anal (3.88) 


when n = 1 (and hence x € {0, 1}). i.e., show that 


t 
Oy 


Bo(la{,ao,1) = ——-— 3.89 
(1ļa1, ao, 1) ata (3.89) 

Hint: use the fact that 
T'(ao + a1 +1) = (ao + aı + 1) (ao + a1) (3.90) 


Exercise 3.4 Beta updating from censored likelihood 

(Source: Gelman.) Suppose we toss a coin n = 5 times. Let X be the number of heads. We observe that 
there are fewer than 3 heads, but we don’t know exactly how many. Let the prior probability of heads be 
p(@) = Beta(6|1, 1). Compute the posterior p(@|X < 3) up to normalization constants, i.e., derive an 
expression proportional to p(@, X < 3). Hint: the answer is a mixture distribution. 

Exercise 3.5 Uninformative prior for log-odds ratio 

Let 


= logit(@) = log an (3.91) 


Show that if p(@) « 1, then p(@) x Beta(6|0,0). Hint: use the change of variables formula. 


Exercise 3.6 MLE for the Poisson distribution 


The Poisson pmf is defined as Poi(x|A) = e~* x, for x € {0,1,2,...} where A > 0 is the rate 
parameter. Derive the MLE. 


Exercise 3.7 Bayesian analysis of the Poisson distribution 

In Exercise 3.6, we defined the Poisson distribution with rate and derived its MLE. Here we perform a 

conjugate Bayesian analysis. 

a. Derive the posterior p(A|D) assuming a conjugate prior p(A) = Ga(Ala,b) œ A*~te~*?. Hint: the 
posterior is also a Gamma distribution. 

b. What does the posterior mean tend to as a —> 0 and b + 0? (Recall that the mean of a Ga(a, b) 
distribution is a/b.) 

Exercise 3.8 MLE for the uniform distribution 


(Source: Kaelbling.) Consider a uniform distribution centered on 0 with width 2a. The density function is 
given by 


p(x) = aC € [-a, a]) (3.92) 
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a. Given a data set 71,...,2n, what is the maximum likelihood estimate of a (call it a)? 
b. What probability would the model assign to a new data point £n+1 using â? 


c. Do you see any problem with the above approach? Briefly suggest (in words) a better approach. 


Exercise 3.9 Bayesian analysis of the uniform distribution 


Consider the uniform distribution Unif (0, 0). The maximum likelihood estimate is 9 = max(D), as we 
saw in Exercise 3.8, but this is unsuitable for predicting future data since it puts zero probability mass 
outside the training data. In this exercise, we will perform a Bayesian analysis of the uniform distribution 
(following (Minka 200la)). The conjugate prior is the Pareto distribution, p(@) = Pareto(@|b, K), defined in 


Section 2.4.6. Given a Pareto prior, the joint distribution of 0 and D = (a1,...,2n) is 
Kb® 


Let m = max(D). The evidence (the probability that all N samples came from the same uniform 
distribution) is 


p(D) = f EESK (3.94) 


l E ifm <b 


(N+K)bN 
Kb 
(N+K)mN+K 


3.95 
ifm>b aaa 


Derive the posterior p(@|D), and show that if can be expressed as a Pareto distribution. 


Exercise 3.10 Taxicab (tramcar) problem 


Suppose you arrive in a new city and see a taxi numbered 100. How many taxis are there in this city? Let 
us assume taxis are numbered sequentially as integers starting from 0, up to some unknown upper bound 
0. (We number taxis from 0 for simplicity; we can also count from 1 without changing the analysis.) Hence 
the likelihood function is p(x) = U (0, 0), the uniform distribution. The goal is to estimate 0. We will use 
the Bayesian analysis from Exercise 3.9. 


a. Suppose we see one taxi numbered 100, so D = {100}, m = 100, N = 1. Using an (improper) 
non-informative prior on @ of the form p(@) = Pa(6|0,0) œ 1/6, what is the posterior p(0| D)? 


b. Compute the posterior mean, mode and median number of taxis in the city, if such quantities exist. 


c. Rather than trying to compute a point estimate of the number of taxis, we can compute the predictive 
density over the next taxicab number using 


p(D'D,a) = f p(D'|e)p(0\D, a)d0 = p(D"|8) 6.36 
where a = (b, K) are the hyper-parameters, 3 = (c, N + K) are the updated hyper-parameters. Now 
consider the case D = {m}, and D’ = {x}. Using Equation 3.95, write down an expression for 

p(2|D, a) (3.97) 


As above, use a non-informative prior b = K = 0. 


d. Use the predictive density formula to compute the probability that the next taxi you will see (say, 
the next day) has number 100, 50 or 150, i.e., compute p(x = 100|D,a), p(x = 50|D, a), p(x = 
150|D, a). 


e. Briefly describe (1-2 sentences) some ways we might make the model more accurate at prediction. 
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Exercise 3.11 Bayesian analysis of the exponential distribution 

A lifetime X of a machine is modeled by an exponential distribution with unknown parameter 0. The 
likelihood is p(x|0) = 0e~°* for x > 0,0 > 0. 

a. Show that the MLE is Ô = 1/7, where T = 4 Ptv 


b. Suppose we observe Xı 5, X2 6, X3 4 (the lifetimes (in years) of 3 different iid machines). 
What is the MLE given this data? 


c. Assume that an expert believes 0 should have a prior distribution that is also exponential 


p(0) = Expon(6|A) (3.98) 


Choose the prior parameter, call it Â, such that E [6] = 1/3. Hint: recall that the Gamma distribution 
has the form 


Ga(@la,b) œx 6% te (3.99) 
and its mean is a/b. 
d. What is the posterior, p(0|D, Â)? 
e. Is the exponential prior conjugate to the exponential likelihood? 
f. What is the posterior mean, E oip, â|? 


g. Explain why the MLE and posterior mean differ. Which is more reasonable in this example? 


Exercise 3.12 MAP estimation for the Bernoulli with non-conjugate priors 
(Source: Jaakkola.) In the book, we discussed Bayesian inference of a Bernoulli rate parameter with the 
prior p(0) = Beta(@|a, 3). We know that, with this prior, the MAP estimate is given by 


N Ni+a-1 
6= 3.100 
N+a+B-2 l ) 


where Nj; is the number of heads, No is the number of tails, and N = No + N is the total number of 
trials. 


a. Now consider the following prior, that believes the coin is fair, or is slightly biased towards tails: 


0.5 if 0=05 
p6) = 0.5 if0=0.4 (3.101) 
0 otherwise 


Derive the MAP estimate under this prior as a function of Ni and N. 


b. Suppose the true parameter is 9 = 0.41. Which prior leads to a better estimate when N is small? 
Which prior leads to a better estimate when N is large? 


Exercise 3.13 Posterior predictive distribution for a batch of data with the dirichlet-multinomial model 


In Equation 3.51, we gave the the posterior predictive distribution for a single multinomial trial using a 


dirichlet prior. Now consider predicting a batch of new data, D = (X1,..., Xm), consisting of m single 
multinomial trials (think of predicting the next m words in a sentence, assuming they are drawn iid). 
Derive an expression for 


p(D|D, a) (3.102) 
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Your answer should be a function of a, and the old and new counts (sufficient statistics), defined as 


Ng"? XO iar =k) (3.103) 
1ED 

ne XO I(r =k) (3.104) 
ieD 


Hint: recall that, for a vector of counts, N1:x, the marginal likelihood (evidence) is given by 


p(D\|a) = ate II ee (3.105) 


where a = >, ax and N = 3°, Np. 


Exercise 3.14 Posterior predictive for Dirichlet-multinomial 


(Source: Koller.). 


a. Suppose we compute the empirical distribution over letters of the Roman alphabet plus the space 
character (a distribution over 27 values) from 2000 samples. Suppose we see the letter “e” 260 times. 


What is p(x2001 = e|D), if we assume 8 ~ Dir(ai1,...,@27), where ag = 10 for all k? 

b. Suppose, in the 2000 samples, we saw “e” 260 times, “a” 100 times, and “p” 87 times. What is 
p(x2001 = P, £2002 = a|D), if we assume 0 ~ Dir(ai,...,a27), where a, = 10 for all k? Show 
your work. 


Exercise 3.15 Setting the beta hyper-parameters 


Suppose 0 ~ 6(a),a2) and we believe that E [0] = m and var [6] = v. Using Equation 2.62, solve for 
a and a2 in terms of m and v. What values do you get if m = 0.7 and v = 0.272 


Exercise 3.16 Setting the beta hyper-parameters II 


(Source: Draper.) Suppose 6 ~ (a1, a2) and we believe that E [0] = m and p(£ < 0 < u) = 0.95. 
Write a program that can solve for a; and a2 in terms of m, £ and u. Hint: write a2 as a function of a4 
and m, so the pdf only has one unknown; then write down the probability mass contained in the interval 
as an integral, and minimize its squared discrepancy from 0.95. What values do you get if m = 0.15, 
£ = 0.05 and u = 0.3? What is the equivalent sample size of this prior? 


Exercise 3.17 Marginal likelihood for beta-binomial under uniform prior 

Suppose we toss a coin N times and observe N; heads. Let Nı ~ Bin(N, 0) and 0 ~ Beta(1, 1). Show 
that the marginal likelihood is p(Ni|N) = 1/(N + 1). Hint: r(x + 1) = z! if x is an integer. 

Exercise 3.18 Bayes factor for coin tossing 


Suppose we toss a coin N = 10 times and observe Ni = 9 heads. Let the null hypothesis be that the 
coin is fair, and the alternative be that the coin can have any bias, so p(@) = Unif(0,1). Derive the 
Bayes factor BF, in favor of the biased coin hypothesis. What if N = 100 and N; = 90? Hint: see 
Exercise 3.17. 


Exercise 3.19 Irrelevant features with naive Bayes 


(Source: Jaakkola.) Let x; = 1 if word w occurs in document 7 and £iw = 0 otherwise. Let 0c be the 
estimated probability that word w occurs in documents of class c. Then the log-likelihood that document 
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x belongs to class c is 


log p(xi|c,9) = log Il Ger — bew) T (3.106) 
w wel 
= 5 Ziw log Pew + (1 — tiw) log(1 — bew) (3.107) 
a Dew 
= 2 Ziw log 7g + 2 log(1 — bew) (3.108) 


where W is the number of words in the vocabulary. We can write this more succintly as 


log p(xilc,0) = (x:)"B. (3.109) 
where x; = (i1,..., ziw ) is a bit vector, O(x:) = (xi, 1), and 
B. = (log Ger ,--., log ai =~ y log(1 — bew))” (3.110) 
i 1— ĝa’ i 1— bew” 


w 


We see that this is a linear classifier, since the class-conditional density is a linear function (an inner 
product) of the parameters 8.. 


a. Assuming p(C = 1) = p(C = 2) = 0.5, write down an expression for the log posterior odds ratio, 
log, seal Se, in terms of the features @(x;) and the parameters 3, and 683. 

b. Intuitively, words that occur in both classes are not very “discriminative”, and therefore should not 
affect our beliefs about the class label. Consider a particular word w. State the conditions on 61, and 
02,w (or equivalently the conditions on (1,~,$2,w) under which the presence or absence of w in a 
test document will have no effect on the class posterior (such a word will be ignored by the classifier). 
Hint: using your previous result, figure out when the posterior odds ratio is 0.5/0.5. 


c. The posterior mean estimate of 0, using a Beta(1,l) prior, is given by 


ô. = L D Tiw 
_ 24+M7¢ 


where the sum is over the ne documents in class c. Consider a particular word w, and suppose it 
always occurs in every document (regardless of class). Let there be nı documents of class 1 and n2 be 
the number of documents in class 2, where nı # nə (since e.g, we get much more non-spam than 
spam; this is an example of class imbalance). If we use the above estimate for 0-w, will word w be 
ignored by our classifier? Explain why or why not. 


(3.1) 


d. What other ways can you think of which encourage “irrelevant” words to be ignored? 


Exercise 3.20 Class conditional densities for binary data 


Consider a generative classifier for C classes with class conditional density p(x|y) and uniform class prior 
p(y). Suppose all the D features are binary, x; € {0,1}. If we assume all the features are conditionally 
independent (the naive Bayes assumption), we can write 


D 
p(x|y =c) = Il Ber(2;|6j<) (3.112) 


j= 


This requires DC parameters. 
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a. Now consider a different model, which we will call the “full” model, in which all the features are fully 
dependent (i.e., we make no factorization assumptions). How might we represent p(x|y = c) in this 
case? How many parameters are needed to represent p(x|y = c)? 


b. Assume the number of features D is fixed. Let there be N training cases. If the sample size N is very 
small, which model (naive Bayes or full) is likely to give lower test set error, and why? 


c. If the sample size N is very large, which model (naive Bayes or full) is likely to give lower test set error, 
and why? 


d. What is the computational complexity of fitting the full and naive Bayes models as a function of N 
and D? Use big-Oh notation. (Fitting the model here means computing the MLE or MAP parameter 
estimates. You may assume you can convert a D-bit vector to an array index in O(D) time.) 


e. What is the computational complexity of applying the full and naive Bayes models at test time to a 
single test case? 


f. Suppose the test case has missing data. Let x, be the visible features of size v, and xp be the hidden 
(missing) features of size h, where v + h = D. What is the computational complexity of computing 


p(y|Xv,@) for the full and naive Bayes models, as a function of v and h? 


Exercise 3.21 Mutual information for naive Bayes classifiers with binary features 


Derive Equation 3.76. 


Exercise 3.22 Fitting a naive bayes spam filter by hand 


(Source: Daphne Koller.). Consider a Naive Bayes model (multivariate Bernoulli version) for spam classifica- 


tion with the vocabulary V="secret", "offer", "low", "price", "valued", "customer", "today", "dollar", "million", 


"sports", "is", "for", "play", "healthy", "pizza". We have the following example spam messages "million dollar 
offer", "secret offer today", "secret is secret" and normal messages, "low price for valued customer", "play 


mow 


secret sports today", "sports is healthy", "low price pizza". Give the MLEs for the following parameters: 


Ospam, ê secret|spam’ ?secret|non-spam’ ? sports|non-spam’ *dollar|spam’ 


4.1 


4.1.1 


4.1.2 


Gaussian models 


Introduction 


In this chapter, we discuss the multivariate Gaussian or multivariate normal (MVN), which 
is the most widely used joint probability density function for continuous variables. It will form 
the basis for many of the models we will encounter in later chapters. 

Unfortunately, the level of mathematics in this chapter is higher than in many other chapters. 
In particular, we rely heavily on linear algebra and matrix calculus. This is the price one must 
pay in order to deal with high-dimensional data. Beginners may choose to skip sections marked 
with a *. In addition, since there are so many equations in this chapter, we have put a box 
around those that are particularly important. 


Notation 


Let us briefly say a few words about notation. We denote vectors by boldface lower case letters, 
such as x. We denote matrices by boldface upper case letters, such as X. We denote entries in 
a matrix by non-bold upper case letters, such as X;;. 


All vectors are assumed to be column vectors unless noted otherwise. We use |z1,..., £p] to 
denote a column vector created by stacking D scalars. Similarly, if we write x = [x,,...,xXp], 
where the left hand side is a tall column vector, we mean to stack the x; along the rows; this is 
usually written as x = (x/,...,x4,)7, but that is rather ugly. If we write X = [xi,...,xp], 


where the left hand side is a matrix, we mean to stack the x; along the columns, creating a 
matrix. 


Basics 


Recall from Section 2.5.2 that the pdf for an MVN in D dimensions is defined by the following: 


1 


Nede =) S Goose 


=p) a (x -— p) (4.1) 


1 
exp |—5 
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1/2 
1/2 M 
dy 


Figure 4.1 Visualization of a 2 dimensional Gaussian density. The major and minor axes of the ellipse 
are defined by the first two eigenvectors of the covariance matrix, namely u; and u2. Based on Figure 2.7 
of (Bishop 2006a). 


The expression inside the exponent is the Mahalanobis distance between a data vector x 
and the mean vector u, We can gain a better understanding of this quantity by performing an 
eigendecomposition of X. That is, we write X = UAUT, where U is an orthonormal matrix 
of eigenvectors satsifying UTU = I, and A is a diagonal matrix of eigenvalues. 

Using the eigendecomposition, we have that 


D 
1 
D= UTA U~ = UA UT =Y ` u;u? 4.2 
Zay ea 


where u; is the ith column of U, containing the i'th eigenvector. Hence we can rewrite the 
Mahalanobis distance as follows: 


D 
1 
(x—p)PE(x-—p) = (x-y) (>. wa) (x—p) (4.3) 
im Ai 
Dy D g 
= a T a 44 
Da p) uiu; (x — p) >, l (4.4) 
where y; = u/ (x — p). Recall that the equation for an ellipse in 2d is 
2 2 
Yr, Y2 
Si poa eee | 4, 
ee oa 


Hence we see that the contours of equal probability density of a Gaussian lie along ellipses. 
This is illustrated in Figure 4.1. The eigenvectors determine the orientation of the ellipse, and 
the eigenvalues determine how elogonated it is. 

In general, we see that the Mahalanobis distance corresponds to Euclidean distance in a 
transformed coordinate system, where we shift by yz and rotate by U. 


4.1.3 


4.1.3.1 
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MLE for an MVN 


We now describe one way to estimate the parameters of an MVN, using MLE. In later sections, 
we will discuss Bayesian inference for the parameters, which can mitigate overfitting, and can 
provide a measure of confidence in our estimates. 


Theorem 4.1.1 (MLE for a Gaussian). If we have N iid samples x; ~ N (u, ©), then the MLE for 
the parameters is given by 


. be. x. 

lime = Fy >, xix (4.6) 
A i te 

See = T 2 x)(x; — x)? = wd x;x))— xx? (4.7) 


That is, the MLE is just the empirical mean and empirical covariance. In the univariate case, we 
get the following familiar results: 


Fat (a; - T) = Es S\ a?) -E (4.9) 
Ve Ne 


Proof * 


To prove this result, we will need several results from matrix algebra, which we summarize 
below. In the equations, a and b are vectors, and A and B are matrices. Also, the notation 
tr(A) refers to the trace of a matrix, which is the sum of its diagonals: tr(A) = }>; Aix. 


O(bta) _ 
da 
O(a’ Aa) _ 
in (A+A°* )a 
o pT (4.10) 
ð ATAK T 
za VSIA =A 2 (A) 
tr(ABC) = tr(CAB) = tr(BCA) 


The last equation is called the cyclic permutation property of the trace operator. Using this, 
we can derive the widely used trace trick, which reorders the scalar inner product x’ Ax as 
follows 


x? Ax = tr(x’ Ax) = tr(xx" A) = tr(Axx’) (4.1) 
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100 
(4.12) 


Proof. We can now begin with the proof. The log-likelihood is 
N 
N 1 
p log |A| — 3 bee: — p)” A(x: — p) 


i=l 


L(p, X) = log p(D| pu, X) 
(4.13) 


(4.14) 


where A = X`! is the precision matrix. 
Using the substitution y; = x; — ys and the chain rule of calculus, we have 
ð Ty-1 ð  Ty-1, Oi 
p= x i= = 5-y¥,2 yi 
(xi — p) (x: — p) ay, Vion 
-1(571 + D7 )y; 


Op 


(4.15) 


(4.16) 


Hence 
ies ; 
E 5 a ; 
} i=1 


(4.17) 


N 1 
> log |A| — 3 > tr[(x; — u) (x; — u)” Al 
(4.18) 
(4.19) 


1 
log |A| — att [S,,A] 


So the MLE of yz is just the empirical mean. 
Now we can use the trace-trick to rewrite the log-likelihood for A as follows 


A) = 


2 
(4.20) 


where 


N 
T 
S$ > =p) 
i=1 
is the scatter matrix centered on u. Taking derivatives of this expression with respect to A 
(4.21) 


yields 
UA) 
OA 
Ats wS (4.22) 
(4.23) 


N 1 
ATT “A =0 


j 
4q 
II 


so 


1a 
B= 5 xi 
7 
which is just the empirical covariance matrix centered on p. If we plug-in the MLE u = X 
(since both parameters must be simultaneously optimized), we get the standard equation for the 


MLE of a covariance matrix. 


4.1.4 


4.2 
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Maximum entropy derivation of the Gaussian * 


In this section, we show that the multivariate Gaussian is the distribution with maximum entropy 
subject to having a specified mean and covariance (see also Section 9.2.6). This is one reason the 
Gaussian is so widely used: the first two moments are usually all that we can reliably estimate 
from data, so we want a distribution that captures these properties, but otherwise makes as few 
addtional assumptions as possible. 
To simplify notation, we will assume the mean is zero. The pdf has the form 
1 


P(X) =F exp(— 5x7 Ex) (4.24) 


If we define fij(x) = x;2; and 4; = $(X~'),;, for i,j € {1,...,D}, we see that this is in 


the same form as Equation 9.74. The (differential) entropy of this distribution (using log base e) 
is given by 


A(N (u, )) = sin [(27e)? |S] (4.25) 


We now show the MVN has maximum entropy amongst all distributions with a specified co- 
variance X. 


Theorem 4.1.2. Let q(x) be any density satisfying f q(x)xjxzj = Xij. Let p = N(0,%). Then 


h(q) < A(p). 
Proof. (From (Cover and Thomas 1991, p234).) We have 
0 < RL(qdlp) = fiw a ae (4.26) 
p(x) 
= —h(q)- / q(x) log p(x)dx (4.27) 
=* —hA(q) - | p(x) log p(x)dx (4.28) 
= —h(q) + h(p) (4.29) 


where the key step in Equation 4.28 (marked with a *) follows since q and p yield the same 
moments for the quadratic form encoded by log p(x). 


Gaussian discriminant analysis 


One important application of MVNs is to define the the class conditional densities in a generative 
classifier, i.e., 


ply = c, 0) = N(x|u,, Ye) (4.30) 


The resulting technique is called (Gaussian) discriminant analysis or GDA (even though it is a 
generative, not discriminative, classifier — see Section 8.6 for more on this distinction). If Xe is 
diagonal, this is equivalent to naive Bayes. 
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Figure 4.2 (a) Height/weight data. (b) Visualization of 2d Gaussians fit to each class. 95% of the probability 
mass is inside the ellipse. Figure generated by gaussHeightWeight. 


We can classify a feature vector using the following decision rule, derived from Equation 2.13: 


ĝ(x) = argmax [log p(y = c|7) + log p(x|0<)] (4.31) 


When we compute the probability of x under each class conditional density, we are measuring 
the distance from x to the center of each class, u., using Mahalanobis distance. This can be 
thought of as a nearest centroids classifier. 

As an example, Figure 4.2 shows two Gaussian class-conditional densities in 2d, representing 
the height and weight of men and women. We can see that the features are correlated, as is 
to be expected (tall people tend to weigh more). The ellipses for each class contain 95% of the 
probability mass. If we have a uniform prior over classes, we can classify a new test vector as 
follows: 


g(x) = argmin(x — pe) Ez (x — He) (4.32) 


Quadratic discriminant analysis (QDA) 


The posterior over class labels is given by Equation 2.13. We can gain further insight into this 
model by plugging in the definition of the Gaussian density, as follows: 


Te|2rDe|~2 exp [—4 (x — JD x= ue 
ply =clx,0) = ara i | 20 He) f x aa) (4.33) 
De Fe Ba Se 2 exp [—FOe — o)PEa x — pe) 


Thresholding this results in a quadratic function of x. The result is known as quadratic 
discriminant analysis (QDA). Figure 4.3 gives some examples of what the decision boundaries 
look like in 2D. 
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Some Linear, Some Quadratic 


fi 1 1 L fi fi 1 1 
-2 0 2 -2 0 2 4 6 


Figure 4.3 Quadratic decision boundaries in 2D for the 2 and 3 class case. Figure generated by 
discrimAnalysisDboundariesDemo. 


T=100 T=1 T=0.1 T=0.01 
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Figure 4.4 Softmax distribution S(ņ/T), where 7 = (3,0,1), at different temperatures T. When the 
temperature is high (left), the distribution is uniform, whereas when the temperature is low (right), the 
distribution is “spiky”, with all its mass on the largest element. Figure generated by softmaxDemo2. 


4.2.2 Linear discriminant analysis (LDA) 


We now consider a special case in which the covariance matrices are tied or shared across 
classes, X. = ©. In this case, we can simplify Equation 4.33 as follows: 


1 1 
ply = c|x,9) x m.exp [uzax — 5x 2x — Ta (4.34) 
1 1 
= exp Coe — she SMe + log exp[-5x" 27x] (4.35) 


Since the quadratic term x7 ©~'x is independent of c, it will cancel out in the numerator and 
denominator. If we define 


1 
Ve = — sls E Me + log Te (4.36) 


Be = Ep (4.37) 
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then we can write 


(y = cx, 0) =e S(n) 4,38 
PUY = Ex, 0) = mar = OM c (4.38) 
Ja eba SHY 
where 7) = [BTx +1.. , Box + 7c], and S is the softmax function, defined as follows: 
elle 
S())¢ = =e (4.39) 


De ere! 


The softmax function is so-called since it acts a bit like the max function. To see this, let us 
divide each 7. by a constant T called the temperature. Then as T —> 0, we find 


E 1.0 ifc= argmaX o Ne 
S(n/T)e = { 0.0 otherwise 


In other words, at low temperatures, the distribution spends essentially all of its time in the 
most probable state, whereas at high temperatures, it visits all states uniformly. See Figure 4.4 
for an illustration. Note that this terminology comes from the area of statistical physics, where 
it is common to use the Boltzmann distribution, which has the same form as the softmax 


(4.40) 


function. 

An interesting property of Equation 4.38 is that, if we take logs, we end up with a linear 
function of x. (The reason it is linear is because the x! ©~'x cancels from the numerator 
and denominator.) Thus the decision boundary between any two classes, say c and c’, will be 
a straight line. Hence this technique is called linear discriminant analysis or LDA. ! We can 
derive the form of this line as follows: 


ply =c|x,0) = ply =c|x,@) (4.41) 
BEx+ 70 = BUX+% (4.42) 
x" (By —-B) = Yete (4.43) 


See Figure 4.5 for some examples. 

An alternative to fitting an LDA model and then deriving the class posterior is to directly 
fit p(y|x, W) = Cat(y|Wx) for some C x D weight matrix W. This is called multi-class 
logistic regression, or multinomial logistic regression.? We will discuss this model in detail 
in Section 8.2. The difference between the two approaches is explained in Section 8.6. 


Two-class LDA 


To gain further insight into the meaning of these equations, let us consider the binary case. In 
this case, the posterior is given by 
eBixtn 
ply Ty = cSt xt 4 ebo x+y (4.44) 
1 


1+ e(Bo-B1) 7 x+(¥0-11) 


= sigm ((@, — Bo)7x + (yı — 70)) (4.45) 


1. The abbreviation “LDA”, could either stand for “linear discriminant analysis” or “latent Dirichlet allocation” (Sec- 
tion 27.3). We hope the meaning is clear from text. 

2. In the language modeling community, this model is called a maximum entropy model, for reasons explained in 
Section 9.2.6. 
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Linear Boundary All Linear Boundaries 
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Figure 4.5 Linear decision boundaries in 2D for the 2 and 3 class case. Figure generated by 
discrimAnalysisDboundariesDemo. 


Figure 4.6 Geometry of LDA in the 2 class case where X1 = X2 = I. 


where sigm(7) refers to the sigmoid function (Equation 1.10). 


Now 
1 -% = ar aes + SHEE Ho + log(11/70) (4.46) 
= 5(01 — Ho) E (p + po) + loam /0) (447) 
So if we define 
w = By By ==" (Hy — Mo) (4.48) 
ty = A(t) Dea a) (4.49) 


2 (m — Ho) TE (m — Ho) 


4.2.4 
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then we have wx = —(71 — Yo), and hence 


p(y =1|x,0) = sigm(w7 (x — xo)) (4.50) 


(This is closely related to logistic regression, which we will discuss in Section 8.2.) So the final 
decision rule is as follows: shift x by xg, project onto the line w, and see if the result is positive 
or negative. 

If © = 07, then w is in the direction of 4 — pọ. So we classify the point based on whether 
its projection is closer to {4p or p4. This is illustrated in Figure 4.6. Furthemore, if mı = 7, then 
Xo = 5 (py + Ho), which is half way between the means. If we make mı > mo, then xo gets 
closer to 4o, so more of the line belongs to class 1 a priori. Conversely if 71 < 7, the boundary 
shifts right. Thus we see that the class prior, me, just changes the decision threshold, and not 
the overall geometry, as we claimed above. (A similar argument applies in the multi-class case.) 

The magnitude of w determines the steepness of the logistic function, and depends on 
how well-separated the means are, relative to the variance. In psychology and signal detection 
theory, it is common to define the discriminability of a signal from the background noise using 
a quantity called d-prime: 


d ê Hı — Ho 
o 


(4.51) 


where ju; is the mean of the signal and uo is the mean of the noise, and ø is the standard 
deviation of the noise. If d’ is large, the signal will be easier to discriminate from the noise. 


MLE for discriminant analysis 


We now discuss how to fit a discriminant analysis model. The simplest way is to use maximum 
likelihood. The log-likelihood function is as follows: 


logp(D|0) = D. i = c) log Te as 5 log N (x|ue, De) (4.52) 


t=1 c=1 c=1 | t:yz;=c 


We see that this factorizes into a term for m, and C terms for each u, and Xe. Hence we 
can estimate these parameters separately. For the class prior, we have îe = Ne, as with naive 
Bayes. For the class-conditional densities, we just partition the data based on its class label, and 
compute the MLE for each Gaussian: 


R 1 a 1 4 ~ \T 
be = N. 2 x, Ue= N, a = Ôe) (Xi — Êe) (4.53) 


See discrimAnalysisFit for a Matlab implementation. Once the model has been fit, you can 
make predictions using discrimAnalysisPredict, which uses a plug-in approximation. 


Strategies for preventing overfitting 


The speed and simplicity of the MLE method is one of its greatest appeals. However, the MLE 
can badly overfit in high dimensions. In particular, the MLE for a full covariance matrix is 
singular if Ne < D. And even when N. > D, the MLE can be ill-conditioned, meaning it is 
close to singular. There are several possible solutions to this problem: 


4.2.6 
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e Use a diagonal covariance matrix for each class, which assumes the features are conditionally 
independent; this is equivalent to using a naive Bayes classifier (Section 3.5). 


e Use a full covariance matrix, but force it to be the same for all classes, ©. = X. This is an 
example of parameter tying or parameter sharing, and is equivalent to LDA (Section 4.2.2). 


e Use a diagonal covariance matrix and forced it to be shared. This is called diagonal covariance 
LDA, and is discussed in Section 4.2.7. 


e Use a full covariance matrix, but impose a prior and then integrate it out. If we use a 
conjugate prior, this can be done in closed form, using the results from Section 4.6.3; this 
is analogous to the “Bayesian naive Bayes” method in Section 3.5.1.2. See (Minka 2000f) for 
details. 


e Fit a full or diagonal covariance matrix by MAP estimation. We discuss two different kinds 
of prior below. 


e Project the data into a low dimensional subspace and fit the Gaussians there. See Sec- 
tion 8.6.3.3 for a way to find the best (most discriminative) linear projection. 


We discuss some of these options below. 


Regularized LDA * 


Suppose we tie the covariance matrices, so Xe = ®©, as in LDA, and furthermore we perform 


MAP estimation of © using an inverse Wishart prior of the form IW (diag(¥ mie), vo) (see 
Section 4.5.1). Then we have 


È = Adiag(Mmie) + (1 — A) Ente (4.54) 


where À controls the amount of regularization, which is related to the strength of the prior, vo 
(see Section 4.6.2.1 for details). This technique is known as regularized discriminant analysis 
or RDA (Hastie et al. 2009, p656). 

-1 


When we evaluate the class conditional densities, we need to compute SO and hence ee 
which is impossible to compute if D > N. However, we can use the SVD of X (Section 12.2.3) 
to get around this, as we show below. (Note that this trick cannot be applied to QDA, which is 
a nonlinear function of x.) 

Let X = UDV” be the SVD of the design matrix, where V is D x N, U is an N x N 
orthogonal matrix, and D is a diagonal matrix of size N. Furthermore, define the N x N 
matrix Z = UD; this is like a design matrix in a lower dimensional space (since we assume 
N < D). Also, define u, = VT u as the mean of the data in this reduced space; we can recover 
the original mean using p = Vy, since VTV = VVT = I. With these definitions, we can 
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rewrite the MLE as follows: 


a 1 
Umle = ye x eee (4.55) 

1 
= <(2NT)"(ZN") — (Vu,)(V u)" 4.56) 

1 
= WVZ ZV” Veuve (4.57) 

1 
= V(gZ Z- um) V" (4.58) 
= VÝ, VT (4.59) 
where $, is the empirical covariance of Z. Hence we can rewrite the MAP estimate as 

Snap = VAV (4.60) 
Š, = Adiag(S,) + (1—A)dz (4.61) 


Note, however, that we never need to actually compute the D x D matrix Dee This is because 
Equation 4.38 tells us that to classify using LDA, all we need to compute is p(y = c|x, 0) œ 
exp(d-), where 


1 


a— 1 
de = —x"B, F Yc, Be =X Mes Ye 7 ale Be F log Te (4.62) 


We can compute the crucial 6, term for RDA without inverting the D x D matrix as follows: 
Be = Èpaphte = (VELVT) 1p, = VŠZ VT p, = VEZ bee (4.63) 


where u, = VT u, is the mean of the Z matrix for data belonging to class c. See rdaFit for 
the code. 


Diagonal LDA 


A simple alternative to RDA is to tie the covariance matrices, so Xe = © as in LDA, and then to 
use a diagonal covariance matrix for each class. This is called the diagonal LDA model, and is 
equivalent to RDA with \ = 1. The corresponding discriminant function is as follows (compare 
to Equation 4.33): 


de(x) = log p(x, y = cl@) = (aj = Hes)” log m (4.64) 
c , / 202 T c . 
j=l 1 
Typically we set fi,; = Tej and ô? = s*, which is the pooled empirical variance of feature j 
(pooled across classes) defined by 
C = 
2 Xei Y iyi=cl(Ti = Tej)? (4.65) 
a N-C 


In high dimensional settings, this model can work much better than LDA and RDA (Bickel and 
Levina 2004). 
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Figure 4.7 Error versus amount of shrinkage for nearest shrunken centroid classifier applied to the 
SRBCT gene expression data. Based on Figure 18.4 of (Hastie et al. 2009). Figure generated by 
shrunkenCentroidsSRBCTdemo. 


Nearest shrunken centroids classifier * 


One drawback of diagonal LDA is that it depends on all of the features. In high dimensional 
problems, we might prefer a method that only depends on a subset of the features, for reasons 
of accuracy and interpretability. One approach is to use a screening method, perhaps based 
on mutual information, as in Section 3.5.4. We now discuss another approach to this problem 
known as the nearest shrunken centroids classifier (Hastie et al. 2009, p652). 

The basic idea is to perform MAP estimation for diagonal LDA with a sparsity-promoting 
(Laplace) prior (see Section 13.3). More precisely, define the class-specific feature mean, Hej, in 
terms of the class-independent feature mean, mj, and a class-specific offset, A.;. Thus we have 


Lej = Mj + Acj (4.66) 


We will then put a prior on the A,; terms to encourage them to be strictly zero and compute 
a MAP estimate. If, for feature j, we find that Ac; = 0 for all c, then feature j will play no role 
in the classification decision (since ej will be independent of c). Thus features that are not 
discriminative are automatically ignored. The details can be found in (Hastie et al. 2009, p652) 
and (Greenshtein and Park 2009). See shrunkenCentroidsFit for some code. 

Let us give an example of the method in action, based on (Hastie et al. 2009, p652). Consider 
the problem of classifying a gene expression dataset, which 2308 genes, 4 classes, 63 training 
samples and 20 test samples. Using a diagonal LDA classifier produces 5 errors on the test set. 
Using the nearest shrunken centroids classifier produced 0 errors on the test set, for a range of 
A values: see Figure 4.7. More importantly, the model is sparse and hence more interpretable: 
Figure 4.8 plots an unpenalized estimate of the difference, dej, in gray, as well as the shrunken 
estimates A.; in blue. (These estimates are computed using the value of À estimated by CV.) 
We see that only 39 genes are used, out of the original 2308. 

Now consider an even harder problem, with 16,603 genes, a training set of 144 patients, a 
test set of 54 patients, and 14 different types of cancer (Ramaswamy et al. 2001). Hastie et al. 
(Hastie et al. 2009, p656) report that nearest shrunken centroids produced 17 errors on the test 
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Figure 4.8 Profile of the shrunken centroids corresponding to A = 4.4 (CV optimal in Fig- 
ure 4.7). This selects 39 genes. Based on Figure 18.4 of (Hastie et al. 2009). Figure generated by 
shrunkenCentroidsSRBCTdemo. 


set, using 6,520 genes, and that RDA (Section 4.2.6) produced 12 errors on the test set, using 
all 16,603 genes. The PMTK function cancerHighDimClassifDemo can be used to reproduce 
these numbers. 


Inference in jointly Gaussian distributions 


Given a joint distribution, p(x, X2), it is useful to be able to compute marginals p(x) and 
conditionals p(x;|x2). We discuss how to do this below, and then give some applications. These 
operations take O(D*) time in the worst case. See Section 20.4.3 for faster methods. 


4.3.1 


4.3.2 
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Statement of the result 


Theorem 4.3.1 (Marginals and conditionals of an MVN). Suppose x = (X1, X2) is jointly Gaussian 
with parameters 


= Hı y= Xi X2 A= yl = Aq Aye 4.67 
j Gs í & =) ? Aoi A22 (467) 
Then the marginals are given by 

pa) = N(xi|Hy, 211) 

p(x2) = N(x2|u2, X22) (4.68) 


and the posterior conditional is given by 


p(X1|X2) = N (xı |hij2; X42) 
Myo = My + Di2ho9 (X2 — Ha) 
= Py — ATi Aio (x2 — Ho) (4.69) 
= Xij (A1141 — A12(X2 — Ho)) 
Xi = Vu — Sich X2 SAG 


Equation 4.69 is of such crucial importance in this book that we have put a box around it, so 
you can easily find it. For the proof, see Section 4.3.4. 

We see that both the marginal and conditional distributions are themselves Gaussian. For the 
marginals, we just extract the rows and columns corresponding to x; or x2. For the conditional, 
we have to do a bit more work. However, it is not that complicated: the conditional mean is 
just a linear function of x2, and the conditional covariance is just a constant matrix that is 
independent of x2. We give three different (but equivalent) expressions for the posterior mean, 
and two different (but equivalent) expressions for the posterior covariance; each one is useful in 
different circumstances. 


Examples 

Below we give some examples of these equations in action, which will make them seem more 
intuitive. 

Marginals and conditionals of a 2d Gaussian 


Let us consider a 2d example. The covariance matrix is 


2 

X= ( ro eae 2) (4.70) 
P0102 05 

The marginal p(x) is a ID Gaussian, obtained by projecting the joint distribution onto the zı 

line: 


p(t1) = N (z1|u1, 07) (4.71) 
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(a) (b) (c) 


Figure 4.9 (a) A joint Gaussian distribution p(xı, x2) with a correlation coefficient of 0.8. We plot the 
95% contour and the principal axes. (b) The unconditional marginal p(x). (c) The conditional p(xı|x2) = 
N (x1|0.8, 0.36), obtained by slicing (a) at height x2 = 1. Figure generated by gaussCondition2Ddemo2. 


Suppose we observe Xə = 2; the conditional p(xı|xz2) is obtained by “slicing” the joint 
distribution through the Xə = xə line (see Figure 4.9): 


O10 7102)? 
plaian) = N (1p + EEF aa- po), of - PERT) 4.72) 
73 a2 
If o1 = 02 = g, we get 
plzilz2) = N (ziu + p(w2 — u2), 07(1 — p*)) (4.73) 


In Figure 4.9 we show an example where p = 0.8, 01 = o2 = 1, p = O and z2 = 1. We 
see that E [x |x2 = 1] = 0.8, which makes sense, since p = 0.8 means that we believe that if 
x2 increases by 1 (beyond its mean), then x, increases by 0.8. We also see var [x,|x2 = 1] = 
1 — 0.8? = 0.36. This also makes sense: our uncertainty about x; has gone down, since we 
have learned something about xı (indirectly) by observing x. If p = 0, we get p(xi|r2) = 
N (zıl Hi, o?), since x2 conveys no information about xı if they are uncorrelated (and hence 
independent). 


Interpolating noise-free data 


Suppose we want to estimate a ld function, defined on the interval [0,7], such that y; = f(t;) 
for N observed points t;. We assume for now that the data is noise-free, so we want to 
interpolate it, that is, fit a function that goes exactly through the data. (See Section 4.4.2.3 for 
the noisy data case.) The question is: how does the function behave in between the observed 
data points? It is often reasonable to assume that the unknown function is smooth. In Chapter 15, 
we shall see how to encode priors over functions, and how to update such a prior with observed 
values to get a posterior over functions. But in this section, we take a simpler approach, which 
is adequate for MAP estimation of functions defined on ld inputs. We follow the presentation 
of (Calvetti and Somersalo 2007, p135). 

We start by discretizing the problem. First we divide the support of the function into D equal 
subintervals. We then define 


T 
xj = f(sj), sj = jh, boa, lejgeD (4.74) 
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Figure 4.10 Interpolating noise-free data using a Gaussian with prior precision A. (a) A = 30. (b) 
A = 0.01. See also Figure 4.15. Based on Figure 7.1 of (Calvetti and Somersalo 2007). Figure generated by 
gaussInterpDemo. 


We can encode our smoothness prior by assuming that x; is an average of its neighbors, xj—1 
and £j+1, plus some Gaussian noise: 


1 
zj = z (8j-1 + 2j41) +6 2<j<SD-2 a 


where e ~ N (0, (1/A)I). The precision term \ controls how much we think the function will 
vary: a large À corresponds to a belief that the function is very smooth, a small A corresponds 
to a belief that the function is quite “wiggly”. In vector form, the above equation can be written 
as follows: 


Lx =e (4.76) 


where L is the (D — 2) x D second order finite difference matrix 


L=- 4.77 
5 : (4.77) 
-1 2 =l 


The corresponding prior has the form 


p(x) = N(x 


2 
0, (A>LTL) 1) œ exp (-Flitxiz) (4.78) 


We will henceforth assume we have scaled L by X so we can ignore the À term, and just write 
A = LTL for the precision matrix. 

Note that although x is D-dimensional, the precision matrix A only has rank D — 2. Thus 
this is an improper prior, known as an intrinsic Gaussian random field (see Section 19.4.4 for 
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more information). However, providing we observe N > 2 data points, the posterior will be 
proper. 

Now let xs be the N noise-free observations of the function, and x; be the D — N unknown 
function values. Without loss of generality, assume that the unknown variables are ordered first, 
then the known variables. Then we can partition the L matrix as follows: 


L=ihy Lo], Lı eRe Ore gee eee (4.79) 


We can also partition the precision matrix of the joint distribution: 


rry _ (An Aw) _ (LiL, LL, 
A=L t A22) — E Li 121s (4.80) 


Using Equation 4.69, we can write the conditional distribution as follows: 


P(xi|xX2) = N(M; X12) (4.81) 
Hi2 = -AT Aix = —LT Lox. (4.82) 
Se. = AG, (4.83) 


Note that we can compute the mean by solving the following system of linear equations: 
Li py). = — Loxo (4.84) 


This is efficient since L, is tridiagonal. Figure 4.10 gives an illustration of these equations. We 
see that the posterior mean pj equals the observed data at the specified points, and smoothly 
interpolates in between, as desired. 

It is also interesting to plot the 95% pointwise marginal credibility intervals, p; + 
2,/X1\2,;;, shown in grey. We see that the variance goes up as we move away from the 
data. We also see that the variance goes up as we decrease the precision of the prior, À. In- 
terestingly, A has no effect on the posterior mean, since it cancels out when multiplying Aj, 
and Aj. By contrast, when we consider noisy data in Section 4.4.2.3, we will see that the prior 
precision affects the smoothness of posterior mean estimate. 

The marginal credibility intervals do not capture the fact that neighboring locations are 
correlated. We can represent that by drawing complete functions (i.e., vectors x) from the 
posterior, and plotting them. These are shown by the thin lines in Figure 4.10. These are not 
quite as smooth as the posterior mean itself. This is because the prior only penalizes first-order 
differences. See Section 4.4.2.3 for further discussion of this point. 


Data imputation 


Suppose we are missing some entries in a design matrix. If the columns are correlated, we can 
use the observed entries to predict the missing entries. Figure 4.11 shows a simple example. We 
sampled some data from a 20 dimensional Gaussian, and then deliberately “hid” 50% of the data 
in each row. We then inferred the missing entries given the observed entries, using the true 
(generating) model. More precisely, for each row i, we compute p(xn,|Xv,,9), where h; and v; 
are the indices of the hidden and visible entries in case i. From this, we compute the marginal 
distribution of each missing variable, p(x;,,,|xv,, 0). We then plot the mean of this distribution, 
ĉij = E[x;|x,y,, 0]; this represents our “best guess” about the true value of that entry, in the 
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observed imputed truth 


5 x i tle, T Logg o OP ? i (0) Mh. 
IET ee ey 
“os 4015 20 ~°o 5 10. 45. 20 Oo 5 10 15 20 


Figure 4.11 Illustration of data imputation. Left column: visualization of three rows of the data matrix 
with missing entries. Middle column: mean of the posterior predictive, based on partially observed 
data in that row, but the true model parameters. Right column: true values. Figure generated by 
gaussImputationDemo. 


sense that it minimizes our expected squared error (see Section 5.7 for details). Figure 4.11 shows 
that the estimates are quite close to the truth. (Of course, if 7 € v;, the expected value is equal 
to the observed value, ĉ;; = xij.) 

We can use var Ee [aya | as a measure of confidence in this guess, although this is not 
shown. Alternatively, we could draw multiple samples from p(xn,|Xv,, 9); this is called multiple 
imputation. 

In addition to imputing the missing entries, we may be interested in computing the like- 
lihood of each partially observed row in the table, p(x,,|@), which can be computed using 


Equation 4.68. This is useful for detecting outliers (atypical observations). 


Information form 


Suppose x ~ N (u, ©). One can show that E [x] = p is the mean vector, and cov [x] = ¥ is 
the covariance matrix. These are called the moment parameters of the distribution. However, 
it is sometimes useful to use the canonical parameters or natural parameters, defined as 


AÊ, €4>'p (4.85) 
We can convert back to the moment parameters using 
=A S=A? (4.86) 


Using the canonical parameters, we can write the MVN in information form (i.e., in exponential 
family form, defined in Section 9.2): 


N(x|€,A) = (20)~?/2|A|? exp —F(xT Ax + ETATE — 2x76) (4.87) 


where we use the notation \V.() to distinguish from the moment parameterization \V(). 
It is also possible to derive the marginalization and conditioning formulas in information 
form. We find 


p(x2) = N(xel€, — A21 AFF E1, A22 — A2 A7 Ai) (4.88) 
P(x1|x2) = Ne(x1|€; — A12X2, A11) (4.89) 
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Thus we see that marginalization is easier in moment form, and conditioning is easier in 
information form. 

Another operation that is significantly easier in information form is multiplying two Gaussians. 
One can show that 


NelEp, Af) Ne(Eg; Ag) = Ne(Ep + Eg, Af + Aq) (4.90) 


However, in moment form, things are much messier: 


(4.91) 


2 2 2.2 
: UfOg + gop 770 
2 2) _ g g 
N (uf, oF) N (Hg, OG) v( o2 + a2 ° G2 + G2 
9 g g g 
Proof of the result * 


We now prove Theorem 4.3.1. Readers who are intimidated by heavy matrix algebra can safely 
skip this section. We first derive some results that we will need here and elsewhere in the book. 
We will return to the proof at the end. 


Inverse of a partitioned matrix using Schur complements 


The key tool we need is a way to invert a partitioned matrix. This can be done using the 
following result. 


Theorem 4.3.2 (Inverse of a partitioned matrix). Consider a general partitioned matrix 


E F 
M = & i (4.92) 
where we assume E and H are invertible. We have 
met = (M/H): -(M/H) `FH- (4.93) 
E -H-!G(M/H)-! H! + H-!G(M/H)!FH-! ' 
_ (E +EF(M/E) GE! —E-!F(M/E)~! ai 
= -(M/E) GE! (M/E)~! 
where 
M/H ê E-FH™'G (4.95) 
M/E ê H-GEF (4.96) 


We say that M/H is the Schur complement of M wrt H. Equation 4.93 is called the partitioned 
inverse formula. 


Proof. If we could block diagonalize M, it would be easier to invert. To zero out the top right 
block of M we can pre-multiply as follows 


I -FH-'\ /E F E- FHG 0 
({ I le o ={ G 3 (4.37) 
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Similarly, to zero out the bottom left we can post-multiply as follows 


E-FH"'G 0 I 0\ /(E-FH'G 0 (4.98) 
G H/ -HIG IJ 0 H ' 
Putting it all together we get 
I -FH"!)\/E F I 0\  /(E-FH'G 0 (4.99) 
0 I G H/)\-H'G I} ~— 0 H i 
wW 


Taking the inverse of both sides yields 


ZÍM X" = W! (4.100) 
and hence 
M“! = ZWľX (4.101) 
Substituting in the definitions we get 
E F\ I 0\ (M/H)! 0 \/í -FH 
e 4 = ee t) ( 0 =) e I ) (4.102) 
= Ge ne ai) o We ') (4.103) 
(M/H)-! —(M/H)-'FH-! 
T (aa aya) Ho! + Een el 


Alternatively, we could have decomposed the matrix M in terms of E and M/E = (H — 
GE 'F), yielding 


k a _ ee Ek 
G H 


7 —(M/E)-!GE-! (M/E)! i 


The matrix inversion lemma 


We now derive some useful corollaries of the above result. 


Corollary 4.3.1 (Matrix inversion lemma). Consider a general partitioned matrix M = fa A 


where we assume E and H are invertible. We have 
(E - FHG)! = E'+E 'F(H-GE'F)'GE" (4.106) 
(E — FHG) FH! = E 'F(H- GEF)! (4.107) 
|IE—FH'G| = |H-GE“'F||H"||E| (4.108) 
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The first two equations are s known as the matrix inversion lemma or the Sherman- 
Morrison-Woodbury formula. The third equation is known as the matrix determinant 
lemma. A typical application in machine learning/ statistics is the following. Let E = X 
be a N x N diagonal matrix, let F = GT = X of size N x D, where N > D, and let 
H-t! = —I. Then we have 


(H+ XX") t=) SHKAK DKK (4.109) 


The LHS takes O(N?) time to compute, the RHS takes time O(D*) to compute. 
Another application concerns computing a rank one update of an inverse matrix. Let 


H = —1 (a scalar), F = u (a column vector), and G = v? (a row vector). Then we have 
(Et+uv’)? = E`! +E tu(—1 -vE u) tvt ET! (4.110) 
E- 'uvT E! 
= E“! 4.111 
1+vTE-!u ey 


This is useful when we incrementally add a data vector to a design matrix, and want to update 
our sufficient statistics. (One can derive an analogous formula for removing a data vector.) 


Proof. To prove Equation 4.106, we simply equate the top left block of Equation 4.93 and Equa- 
tion 4.94. To prove Equation 4.107, we simple equate the top right blocks of Equations 4.93 and 
4.94. The proof of Equation 4.108 is left as an exercise. 


Proof of Gaussian conditioning formulas 


We can now return to our original goal, which is to derive Equation 4.69. Let us factor the joint 
p(xX1,X2) as p(x2)p(x1|X2) as follows: 


1 (x1 — m Pee Big f/x- by 
E = e —= 4.112 
~ [ 2 & — H2 X2 X22 X2 — H2 aan 


Using Equation 4.102 the above exponent becomes 


1 (x, - ts)" ( I o) G 0 ) 
E = 2a È E 4.113 
exp i 2 = _ fis -5 Dai I 0 = ( ) 
I -55 \ (x1 — m 
x ( a mes (4.114) 
1 
= exp -50u = m = B53 (a pe)" (B/Ene)™ (45) 


1 
(x = py = E1253 (xa — pa))} x exp {5 (9 a)” Bai (xo = ma) b aG 
This is of the form 


exp(quadratic form in x1, X2) x exp(quadratic form in x2) (4.117) 
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Hence we have successfully factorized the joint as 
P(x1,X2) = p(xı|x2)p(x2) (4.118) 
= N(x1|My)2, E112) (x2| H2, X22) (4.119) 


where the parameters of the conditional distribution can be read off from the above equations 
using 


hi2 = My + Dg Boy (x2 Ho) (4.120) 
Xiz = Y/Y = %1- Z125 Dar (4.121) 


We can also use the fact that |M| = |M/H||H] to check the normalization constants are 
correct: 


(Qm)i+42)/2) 338 = (2) +42)/2(/5/Bo9| E221)? (4.122) 
= (2r)"/2|/So0|?2 (20) ?/2|Do0]2 (4.123) 


where dı = dim(x;) and dy = dim(xg). 
We leave the proof of the other forms of the result in Equation 4.69 as an exercise. 


Linear Gaussian systems 


Suppose we have two variables, x and y. Let x € IR? be a hidden variable, and y € RP” be 
a noisy observation of x. Let us assume we have the following prior and likelihood: 


P(x) = N (xhaz: Be) 


(4.124) 
P(y|x) = N (y| Ax + b, By) 


where A is a matrix of size D, x Dy. This is an example of a linear Gaussian system. We 
can represent this schematically as x — y, meaning x generates y. In this section, we show 
how to “invert the arrow”, that is, how to infer x from y. We state the result below, then give 
several examples, and finally we derive the result. We will see many more applications of these 
results in later chapters. 


Statement of the result 


Theorem 4.4.1 (Bayes rule for linear Gaussian systems). Given a linear Gaussian system, as in 
Equation 4.124, the posterior p(x|y) is given by the following: 


P(xly) =N(X|Mojys Barly) 
Sy, =, +A DA (4.125) 


aly 
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In addition, the normalization constant p(y) is given by 


p(y) =N(y|Au, +b, Dy + AD, A”) (4.126) 


For the proof, see Section 4.4.3. 


Examples 


In this section, we give some example applications of the above result. 


Inferring an unknown scalar from noisy measurements 


Suppose we make N noisy measurements y; of some underlying quantity x; let us assume the 
measurement noise has fixed precision Ay = 1/ a”, so the likelihood is 


pluilz) = N(yilz, Az”) (4.127) 
Now let us use a Gaussian prior for the value of the unknown source: 
plz) = N(x\H0,A9") (4.128) 


We want to compute p(x|y1,..., YẸN, o°). We can convert this to a form that lets us apply 
Bayes rule for Gaussians by defining y = (y1,..., yn), A = 1% (an 1 x N row vector of 1’), 
and D = diag(A,I). Then we get 


p(tly) = N(a\un, àp) (4.129) 

Aw = A+NdAy (4.130) 
Này + Aoko N ày _ 0 

= 4131 

UN Tw ™, ty! t Wa, + (4.131) 


These equations are quite intuitive: the posterior precision Ay is the prior precision Ao plus N 
units of measurement precision Ày. Also, the posterior mean uy is a convex combination of 
the MLE y and the prior mean po. This makes it clear that the posterior mean is a compromise 
between the MLE and the prior. If the prior is weak relative to the signal strength (Ào is 
small relative to A), we put more weight on the MLE. If the prior is strong relative to the 
signal strength (Ao is large relative to X,,), we put more weight on the prior. This is illustrated 
in Figure 4.12, which is very similar to the analogous results for the beta-binomial model in 
Figure 3.6. 

Note that the posterior mean is written in terms of NA,¥, so having N measurements each 
of precision Ay is like having one measurement with value 7 and precision Ny. 

We can rewrite the results in terms of the posterior variance, rather than posterior precision, 
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Figure 4.12 Inference about x given a noisy observation y = 3. (a) Strong prior \V(0, 1). The posterior 
mean is “shrunk” towards the prior mean, which is 0. (a) Weak prior \V(0,5). The posterior mean is 
similar to the MLE. Figure generated by gaussInferParamsMean1id. 


as follows: 
p(z|D,o?) = N(2\un, TA) (4.132) 
1 ote 
2 0 
i = = (4.133) 
= + z NT +0? 
= 2 2 
2 / Ho NY o Nt} = 
= = 4.134 
AN 4 (4+ | Nero + Nees gat ae 


where rẹ = 1/Ao is the prior variance and 72, = 1/Ay is the posterior variance. 

We can also compute the posterior sequentially, by updating after each observation. If 
N = 1, we can rewrite the posterior after seeing a single observation as follows (where we 
define Uy = a?, Do = Te and 4; = TE to be the variances of the likelihood, prior and 
posterior): 


plely) = N(z|u, £1) (4.135) 
EE (eS 2 (4.136) 
~ A S Se 
Ho y 
= ula ta 4.137 
Hı 1 (£ + £) (4.137) 
We can rewrite the posterior mean in 3 different ways: 
>y jp (4.138) 
Hı = Ho y i 
Eo ee 
Xo 
= ky =. = (4.139) 
Ho + (y Ho) PES 
dy 
= 4.140 
y— (y Ho) Ti (4.140) 
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The first equation is a convex combination of the prior and the data. The second equation is the 
prior mean adjusted towards the data. The third equation is the data adjusted towards the prior 
mean; this is called shrinkage. These are all equivalent ways of expressing the tradeoff between 
likelihood and prior. If Xo is small relative to Ny, corresponding to a strong prior, the amount 
of shrinkage is large (see Figure 4.12(a)), whereas if Mo is large relative to X}, corresponding to 
a weak prior, the amount of shrinkage is small (see Figure 4.12(b)). 

Another way to quantify the amount of shrinkage is in terms of the signal-to-noise ratio, 
which is defined as follows: 


[X°] So +3 
cf] Ly 


SNR = 


(4.141) 


where x ~ N (uo, Uo) is the true signal, y = x + e€ is the observed signal, and € ~ N (0, £) 
is the noise term. 


Inferring an unknown vector from noisy measurements 


Now consider N vector-valued observations, y; ~ N(x, x), and a Gaussian prior, x ~ 
N (Ho, £o). Setting A = I, b = O, and using y for the effective observation with precision 
N 3 | we have 


P(xly1,---;¥n) = N(xlun, En) (4.142) 
Ey = X +NE;' (4.143) 
un = En(X, (NF) + 57'u) (4.144) 


See Figure 4.13 for a 2d example. We can think of x as representing the true, but unknown, 
location of an object in 2d space, such as a missile or airplane, and the y; as being noisy 
observations, such as radar “blips”. As we receive more blips, we are better able to localize the 
source. In Section 18.3.1, we will see how to extend this example to track moving objects using 
the famous Kalman filter algorithm. 

Now suppose we have multiple measuring devices, and we want to combine them together; 
this is known as sensor fusion. If we have multiple observations with different covariances (cor- 
responding to sensors with different reliabilities), the posterior will be an appropriate weighted 
average of the data. Consider the example in Figure 4.14. We use an uninformative prior on x, 
namely p(x) = N (uo, Yo) = M (0, 10!°I2). We get 2 noisy observations, y1 ~ N(x, Dy1) 
and y2 ~ N(x, Hy,2). We then compute p(x|y1, y2). 

In Figure 4.14(a), we set Sy 1 = Xy 2 = 0.0112, so both sensors are equally reliable. In this 
case, the posterior mean is half way between the two observations, yı and y2. In Figure 4.14(b), 
we set X, 1 = 0.0512 and X12 = 0.0112, so sensor 2 is more reliable than sensor 1. In this 
case, the posterior mean is closer to y9. In Figure 4.14(c), we set 


10 1 1 1 
yi = 001 (7 ae E, 2 = 0.01 G D (4.145) 


so sensor l is more reliable in the y2 component (vertical direction), and sensor 2 is more 
reliable in the yı component (horizontal direction). In this case, the posterior mean uses y,’s 
vertical component and y4’s horizontal component. 
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Figure 4.13 Illustration of Bayesian inference for the mean of a 2d Gaussian. (a) The data is generated 
from y; ~ N(x, Sy), where x = [0.5,0.5]7 and ©, = 0.1[2,1;1,1]). We assume the sensor noise 
covariance X, is known but x is unknown. The black cross represents x. (b) The prior is p(x) = 


N (x|0, 0.112). (c) We show the posterior after 10 data points have been observed. Figure generated by 
gaussInferParamsMean2d. 
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Figure 4.14 We observe yı = (0, —1) (red cross) and y2 = (1,0) (green cross) and infer E(u|y1, y2, 0) 
(black cross). (a) Equally reliable sensors, so the posterior mean estimate is in between the two circles. 
(b) Sensor 2 is more reliable, so the estimate shifts more towards the green circle. (c) Sensor 1 is more 
reliable in the vertical direction, Sensor 2 is more reliable in the horizontal direction. The estimate is an 
appropriate combination of the two measurements. Figure generated by sensorFusion2d. 


Note that this technique crucially relies on modeling our uncertainty of each sensor; comput- 
ing an unweighted average would give the wrong result. However, we have assumed the sensor 


precisions are known. When they are not, we should model out uncertainty about 4, and X22 
as well. See Section 4.6.4 for details. 


Interpolating noisy data 


We now revisit the example of Section 4.3.2.2. This time we no longer assume noise-free 
observations. Instead, let us assume that we obtain N noisy observations y;; without loss 
of generality, assume these correspond to z1,...,&y. We can model this setup as a linear 
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Gaussian system: 
y = Ax+e (4.146) 


where € ~ N (0, dy), Ly = o7I, a? is the observation noise, and A is a N x D projection 
matrix that selects out the observed elements. For example, if N = 2 and D = 4 we have 


100 0 
A= € a a (4.147) 


Using the same improper prior as before, ©, = (L7L)~!, we can easily compute the posterior 
mean and variance. In Figure 4.15, we plot the posterior mean, posterior variance, and some 
posterior samples. Now we see that the prior precision effects the posterior mean as well as 
the posterior variance. In particular, for a strong prior (large A), the estimate is very smooth, and 
the uncertainty is low. but for a weak prior (small A), the estimate is wiggly, and the uncertainty 
(away from the data) is high. 

The posterior mean can also be computed by solving the following optimization problem: 


N D 
a Ok À 2 2 
min > Swi yi) + 5 5 E — £tj—1) + (£j — £j+1) (4.148) 
i=1 j=1 
where we have defined xo = zı and £p+1 = £p for notational simplicity. We recognize this 
as a discrete approximation to the following problem: 
1 À 

min 53 (FŒ) — y(t))?dt + z J O dt (4.149) 
where f'(t) is the first derivative of f. The first term measures fit to the data, and the second 
term penalizes functions that are “too wiggly”. This is an example of Tikhonov regularization, 
which is a popular approach to functional data analysis. See Chapter 15 for more sophisticated 
approaches, which enforce higher order smoothness (so the resulting samples look less “jagged”). 


Proof of the result * 


We now derive Equation 4.125. The basic idea is to derive the joint distribution, p(x,y) = 
p(x)p(y|x), and then to use the results from Section 4.3.1 for computing p(x|y). 

In more detail, we proceed as follows. The log of the joint distribution is as follows (dropping 
irrelevant constants): 


1 _ 1 = 
log p(x, y) = -3 7: Ua) Ez (x Hz) 2 (y Ax b)*=, ‘iy — Ax— b) (4.150) 


This is clearly a joint Gaussian distribution, since it is the exponential of a quadratic form. 
Expanding out the quadratic terms involving x and y, and ignoring linear and constant terms, 
we have 


a 1 E 1 = z 
Q = —5x Ey x — "a ly = 5 (Ax)! Ey (Ax) +y" E7 Ax (4.151) 
T =i Ty-l Ty-l 
1 = 
oz, lf/x 5- +ATE, A ATE, x giaa 
2 (Y -5 A 5; y 


(4.153) 
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Figure 4.15 Interpolating noisy data (noise variance o=1) using a Gaussian with prior precision À. (a) 
A = 30. (b) A = 0.01. See also Figure 4.10. Based on Figure 7.1 of (Calvetti and Somersalo 2007). Figure 
generated by gaussInterpNoisyDemo. See also splineBasisDemo. 


where the precision matrix of the joint is defined as 


z 5-14 ATEA -ATE ! Azs A 
I _.. x y y A = ea sy 
D= ( -57'A si 4A= wan (4.154) 


From Equation 4.69, and using the fact that u, = Ay, + b, we have 


P(xly) = N (Maly: Lely) (4.155) 
Lely Azs = (X3 + ATEL A)! (4.156) 
Mey = Daly (Arsha — Azy(y — Hy) (4.157) 

= Dey (£z u +ATE, (y —b)) (4.158) 


Digression: The Wishart distribution * 


The Wishart distribution is the generalization of the Gamma distribution to positive definite 
matrices. Press (Press 2005, p107) has said “The Wishart distribution ranks next to the (multi- 
variate) normal distribution in order of importance and usefuleness in multivariate statistics”. 
We will mostly use it to model our uncertainty in covariance matrices, 4, or their inverses, 
Kas; 

The pdf of the Wishart is defined as follows: 


1 1 
Wi(A|S,v) = a (-Seas)) (4.159) 
1 


Here v is called the “degrees of freedom” and S is the “scale matrix”. (We shall get more 
intuition for these parameters shortly.) The normalization constant for this distribution (which 
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requires integrating over all symmetric pd matrices) is the following formidable expression 
Zwi = 2”P TV p(v/2)|8|"/? (4.160) 


where I p(a) is the multivariate gamma function: 


D 
p(x) =r? PDA] [T (w+ 0- 4)/2) (4.161) 


i=l 


Hence Tı (a) =T (a) and 


w+l—i 


D 
To(vo/2)=][ [T 7 


i=1 


) (4.162) 


The normalization constant only exists (and hence the pdf is only well defined) if v > D — 1. 

There is a connection between the Wishart distribution and the Gaussian. In particular, 
let x; ~ M (0, £). Then the scatter matrix S = ear x,;x? has a Wishart distribution: 
S ~ Wi(X, 1). Hence E [S] = NX. More generally, one can show that the mean and mode of 
Wi(S, v) are given by 


mean = vS, mode = (v — D — 1)S (4.163) 


where the mode only exists if v > D + 1. 
If D = 1, the Wishart reduces to the Gamma distribution: 
V s 


Wi(Als =}, v) = Ga(Alž, 3) 


(4.164) 


Inverse Wishart distribution 


Recall that we showed (Exercise 2.10) that if A ~ Ga(a, b), then that t ~ IG(a,b). Similarly, 
if 2-1 ~ Wi(S,v) then © ~ IW(S7t,v + D + 1), where IW is the inverse Wishart, the 
multidimensional generalization of the inverse Gamma. It is defined as follows, for v > D — 1 
and S > 0: 

1 
Zıw 
Zw = |S[ PPPT p(v/2) (4.166) 


II 


IW(S|S, v) |E 7U +P+1)/2 exp (-5167 2) (4.165) 


One can show that the distribution has these properties 


s-t s-t 
= —___, de = ———_——_ 4.167 
mean in mode p+ D1 ( ) 
If D = 1, this reduces to the inverse Gamma: 


IW(0°|S7t, v) = IG(o?|v/2, 8/2) (4.168) 
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Figure 4.16 Visualization of the Wishart distribution. Left: Some samples from the Wishart distribution, 
=x ~ Wi(S, v), where S = [3.1653, —0.0262; —0.0262, 0.6477] and v = 3. Right: Plots of the marginals 
(which are Gamma), and the approximate (sample-based) marginal on the correlation coefficient. If v = 3 
there is a lot of uncertainty about the value of the correlation coefficient p (see the almost uniform 
distribution on [—1,1]). The sampled matrices are highly variable, and some are nearly singular. As v 
increases, the sampled matrices are more concentrated on the prior S. Figure generated by wiPlotDemo. 


Visualizing the Wishart distribution * 


Since the Wishart is a distribution over matrices, it is hard to plot as a density function. However, 
we can easily sample from it, and in the 2d case, we can use the eigenvectors of the resulting 
matrix to define an ellipse, as explained in Section 4.1.2. See Figure 4.16 for some examples. 

For higher dimensional matrices, we can plot marginals of the distribution. The diagonals of 
a Wishart distributed matrix have Gamma distributions, so are easy to plot. It is hard in general 
to work out the distribution of the off-diagonal elements, but we can sample matrices from 
the distribution, and then compute the distribution empirically. In particular, we can convert 
each sampled matrix to a correlation matrix, and thus compute a Monte Carlo approximation 
(Section 2.7) to the expected correlation coefficients: 


S 
1 
[Ral = => R(D);; (4.169) 


where ©) ~ Wi(£, v) and R(X) converts matrix © into a correlation matrix: 
yi; 
V Bi d;; 


We can then use kernel density estimation (Section 14.7.2) to produce a smooth approximation 
to the univariate density E [R;;] for plotting purposes. See Figure 4.16 for some examples. 


Ry = (4.170) 


Inferring the parameters of an MVN 


So far, we have discussed inference in a Gaussian assuming the parameters 9 = (jz, ©) are 
known. We now discuss how to infer the parameters themselves. We will assume the data has 
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the form x; ~ N (u, ©) for i = 1: N and is fully observed, so we have no missing data (see 
Section 11.6.1 for how to estimate parameters of an MVN in the presence of missing values). To 
simplify the presentation, we derive the posterior in three parts: first we compute p(u|D, ©); 
then we compute p(X)|D, p); finally we compute the joint p(y, X|D). 


Posterior distribution of pu 


We have discussed how to compute the MLE for jz; we now discuss how to compute its posterior, 
which is useful for modeling our uncertainty about its value. 


The likelihood has the form 
1 
plu) = N Glu, F) (4.171) 
For simplicity, we will use a conjugate prior, which in this case is a Gaussian. In particular, if 


plu) = N (u|mo, Vo) then we can derive a Gaussian posterior for u based on the results in 
Section 4.4.2.2. We get 


p(u|D, £) = N(ulmy, Vy) (4.172) 
Vy = Va tN=z (4.173) 
my = Vy(='(NX)+Vo‘mo) (4.174) 


This is exactly the same process as inferring the location of an object based on noisy radar 
“blips”, except now we are inferring the mean of a distribution based on noisy samples. (To a 
Bayesian, there is no difference between uncertainty about parameters and uncertainty about 
anything else.) 

We can model an uninformative prior by setting Vo = ool. In this case we have p(u|D, X) = 
N(x, wd); so the posterior mean is equal to the MLE. We also see that the posterior variance 
goes down as 1/N, which is a standard result from frequentist statistics. 


Posterior distribution of X * 


We now discuss how to compute p(X.|D, u). The likelihood has the form 
N 1 
Pp(D\y,u) x |X|~ 2 exp (-5168,2) (4.175) 


The corresponding conjugate prior is known as the inverse Wishart distribution (Section 4.5.1). 
Recall that this has the following pdf: 


1 
IW(Z|S51,v0) œx [S|-%+P+)/? exp (-560=")) (4.176) 
Here vo > D — 1 is the degrees of freedom (dof), and So is a symmetric pd matrix. We see 


that Sg 1 plays the role of the prior scatter matrix, and No £ vo + D +1 controls the strength 
of the prior, and hence plays a role analogous to the sample size N. 
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Figure 4.17 Estimating a covariance matrix in D = 50 dimensions using N € {100,50, 25} samples. 
We plot the eigenvalues in descending order for the true covariance matrix (solid black), the MLE (dotted 
blue) and the MAP estimate (dashed red), using Equation 4.184 with A = 0.9. We also list the condition 
number of each matrix in the legend. Based on Figure 1 of (Schaefer and Strimmer 2005). Figure generated 
by shrinkcovDemo. 


Multiplying the likelihood and prior we find that the posterior is also inverse Wishart: 


N 


_y ae E 
p(X|D, u) x |X|"? exp (-See 's,)) |E- ot D+)/2 


exp (-31Œso)) (4.177) 
N+(vgtD+1 jl 
= joer exp (-5« sos. +8) (4.178) 
= IW(|Syn, vy) (4.179) 
UN = VY + N (4.180) 
Sy = So+S, (4.181) 


In words, this says that the posterior strength vy is the prior strength vo plus the number of 
observations J, and the posterior scatter matrix Sy is the prior scatter matrix So plus the data 
scatter matrix S. 


MAP estimation 


We see from Equation 4.7 that Smile is a rank min(N,D) matrix. If N < D, this is not 
full rank, and hence will be uninvertible. And even if N > D, it may be the case that X is 
ill-conditioned (meaning it is nearly singular). 

To solve these problems, we can use the posterior mode (or mean). One can show (using 
techniques analogous to the derivation of the MLE) that the MAP estimate is given by 

Š = Sn = So + S H 

mP yy + D+1 MNM++N 

If we use an improper uniform prior, corresponding to No = 0 and Sọ = O, we recover the 
MLE. 


(4.182) 
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Let us now consider the use of a proper informative prior, which is necessary whenever D/N 
is large (say bigger than 0.1). Let u = X, so S, = Sz. Then we can rewrite the MAP estimate 


as a convex combination of the prior mode and the MLE. To see this, let Xo £ So 


— No 
mode. Then the posterior mode can be rewritten as ° 
So + Sz o No So N S a 


hae = | = AX + (1—A)Vnie 4.183 
P MHN  No+NNo' No+NN a: (ABa) 


be the prior 


5 


where \ = Np controls the amount of shrinkage towards the prior. 

This begs the question: where do the parameters of the prior come from? It is common to 
set À by cross validation. Alternatively, we can use the closed-form formula provided in (Ledoit 
and Wolf 2004b,a; Schaefer and Strimmer 2005), which is the optimal frequentist estimate if we 
use squared loss. This is arguably not the most natural loss function for covariance matrices 
(because it ignores the postive definite constraint), but it results in a simple estimator, which 
is implemented in the PMTK function shrinkcov. We discuss Bayesian ways of estimating A 
later. 

As for the prior covariance matrix, So, it is common to use the following (data dependent) 


prior: So = diag(¥ me). In this case, the MAP estimate is given by 


c a, E Dmie(t, j) ifi= j 

Zmap(i, j) = { (1 — A)Èmieli, j) otherwise ae) 
Thus we see that the diagonal entries are equal to their ML estimates, and the off diago- 
nal elements are “shrunk” somewhat towards 0. This technique is therefore called shrinkage 
estimation, or regularized estimation. 

The benefits of MAP estimation are illustrated in Figure 4.17. We consider fitting a 50 dimen- 
sional Gaussian to N = 100, N = 50 and N = 25 data points. We see that the MAP estimate 
is always well-conditioned, unlike the MLE. In particular, we see that the eigenvalue spectrum 
of the MAP estimate is much closer to that of the true matrix than the MLE’s. The eigenvectors, 
however, are unaffected. 

The importance of regularizing the estimate of © will become apparent in later chapters, 
when we consider fitting covariance matrices to high dimensional data. 


Univariate posterior 


In the ld case, the likelihood has the form 


N 
-N 1 
p(D\o?) œx (a?) exp (-= 2 (zi — n) (4.185) 


The standard conjugate prior is the inverse Gamma distribution, which is just the scalar version 
of the inverse Wishart: 


b 
IG(0°|ao, bo) œ (0?) +D exp(——) (4.186) 
oO 
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Figure 4.18 Sequential updating of the posterior for ø? starting from an uninformative prior. The data 
was generated from a Gaussian with known mean jz = 5 and unknown variance g? = 10. Figure generated 
by gaussSeqUpdateSigma1D. 


Multiplying the likelihood and the prior, we see that the posterior is also IG: 


plo? |D) = IG(o7|ay, bn) (4.187) 
an = a +N/2 (4.188) 
1 N 
ne r 2 
bv = bo+5 D(a — u) (4.189) 


See Figure 4.18 for an illustration. 

The form of the posterior is not quite as pretty as the multivariate case, because of the 

factors of £. This arises because IW (0°|so, vo) = IG(o?|2, %2). Another problem with using 
the IG(ao, bo) distribution is that the strength of the prior is encoded in both ap and bo. 
To avoid both of these problems, it is common (in the statistics literature) to use an alternative 
parameterization of the IG distribution, known as the (scaled) inverse chi-squared distribution. 
This is defined as follows: 
Yo voog Voor 
oe =) z (o2)-¥0/2-1 exp(— = 
Here vo controls the strength of the prior, and o@ encodes the value of the prior. With this 
prior, the posterior becomes 


x~2(a?|v0, aa) = IG(o?| 


) (4.190) 


p(o?|D, un) = x 7(o7 lun, on) (4.191) 
YN = YtN (4.192) 
N ‘ 
oy = 95 + ee bu)? (4.193) 
UN 


We see that the posterior dof vy is the prior dof vo plus N, and the posterior sum of squares 
vyo% is the prior sum of squares vo plus the data sum of squares. 

We can emulate an uninformative prior, pla?) x ao ?, by setting vy = 0, which makes 
intuitive sense (since it corresponds to a zero virtual sample size). 
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Posterior distribution of jz and © * 


We now discuss how to compute p(u, &|D). These results are a bit complex, but will prove 
useful later on in this book. Feel free to skip this section on a first reading. 


Likelihood 
The likelihood is given by 


N 
N 1 
p(D|u,E) = (27) P/E] -Ë exp (-; Soi — u) PEM; - w) (4.194) 
i=1 
Now one can show that 
N 
Soi — wT E(x; - pw) = (ASe) + N(R u) E3- u) (4.195) 
i=1 
Hence we can rewrite the likelihood as follows: 
DOME) = PAE ep (Fua) (4.196) 
exp (-Fe"'s2)) (4.197) 
We will use this form below. 
Prior 
The obvious prior to use is the following 
p(t, E) = N (ulmo, Vo)IW(E|So, vo) (4.198) 


Unfortunately, this is not conjugate to the likelihood. To see why, note that u and 4 appear 
together in a non-factorized way in the likelihood; hence they will also be coupled together in 
the posterior. 

The above prior is sometimes called semi-conjugate or conditionally conjugate, since both 
conditionals, p(js|) and p(%|,2), are individually conjugate. To create a full conjugate prior, 
we need to use a prior where u and & are dependent on each other. We will use a joint 
distribution of the form 


D(H, =) = p(=)p(u| =) (4.199) 


Looking at the form of the likelihood equation, Equation 4.197, we see that a natural conjugate 
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prior has the form of a Normal-inverse-wishart or NIW distribution, defined as follows: 


NIW(p, =|mo, Ko, VO, So) £ (4.200) 
1 
N (u|mo, 75) x IW(£|So, vo) (4.201) 
0 
_ ul -i Ko Ty-1 
= zM exp ( S(p- mao) (a mo)) (4.202) 
7 1 
x| E" exp (-5eœso) (4.203) 
1 vi 2 
= z b> a cae (4.204) 
a i 
x exp (-2u ~ mp)? =~! (u — mo) — TOEI) (4.205) 
Zniw = 2°P?Tp(v9/2)(2n/Ko)?/?|So|-? (4.206) 


where I p(a) is the multivariate Gamma function. 

The parameters of the NIW can be interpreted as follows: mọ is our prior mean for p, and 
Ko is how strongly we believe this prior; and So is (proportional to) our prior mean for ©, and 
vo is how strongly we believe this prior.’ 

One can show (Minka 2000f) that the (improper) uninformative prior has the form 


lim N (umo, £/k)IW(E|So, k) x |25| 73| £72 +92 (4.207) 
> 
x |E|-(S+) x NIW(p, BIO, 0,0,01) (4.208) 


In practice, it is often better to use a weakly informative data-dependent prior. A common 
choice (see e.g., (Chipman et al. 2001, p81), (Fraley and Raftery 2007, p6)) is to use Sọ = 
diag(Sz)/N, and vo = D + 2, to ensure E [X] = So, and to set po = X and Ko to some small 
number, such as 0.01. 


3. Although this prior has four parameters, there are really only three free parameters, since our uncertainty in the 
mean is proportional to the variance. In particular, if we believe that the variance is large, then our uncertainty in ju 
must be large too. This makes sense intuitively, since if the data has large spread, it may be hard to pin down its mean. 
See also Exercise 9.1, where we will see the three free parameters more explicitly. If we want separate “control” over our 
confidence in u and X, we must use a semi-conjugate prior. 
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Posterior 


The posterior can be shown (Exercise 4.11) to be NIW with updated parameters: 


pe, XD) = NIW(u,d|my, kn, vv, Sy) (4.209) 
Komo + NX Ko N = 
= = = 4.210 
N KN otn wee ( 
ig Sea (4.211) 
je = wean (4.212) 
N 
Sy = 854524 ot mo)(X — mo)" (4.213) 
= So+S+ Komoms — knmymý (4.214) 


where we have defined S = > | xx? as the uncentered sum-of-squares matrix (this is easier 
to update incrementally than the centered version). 

This result is actually quite intuitive: the posterior mean is a convex combination of the prior 
mean and the MLE, with “strength” «9 + N; and the posterior scatter matrix Sy is the prior 
scatter matrix So plus the empirical scatter matrix Sz plus an extra term due to the uncertainty 


in the mean (which creates its own virtual scatter matrix). 


Posterior mode 


The mode of the joint distribution has the following form: 


Sn 
XD) = r 4.215 
argmax p(u, &|D) (my, © + Da 5) (4.215) 
If we set Kg = 0, this reduces to 
So + Sz 
ED) = (x 4.216 
argmax p(t, &|D) (X 7 pe (4.216) 


The corresponding estimate ®© is almost the same as Equation 4.183, but differs by 1 in the 
denominator, because this is the mode of the joint, not the mode of the marginal. 


Posterior marginals 


The posterior marginal for © is simply 


PED) = f pl, B/D\dy = WŒ|Sw vn) 4.217 
The mode and mean of this marginal are given by 
a Sn Sy 
Sopa oN a eea 4.21 
p= o S Dl a 
One can show that the posterior marginal for u has a multivariate Student T distribution: 
1 
pul) = fom EIDA = Tum ——* Sy D) 219 
kn (UN -D+ 1) 


This follows from the fact that the Student distribution can be represented as a scaled mixture 
of Gaussians (see Equation 11.61). 
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NIX(p.5=0, k,=1, v51, O=1) NIX(p „50, ky=5. v=1, =1) NIX(p „50, k,=1, v55, 51) 


Figure 4.19 The NI x? (mo, Ko, Vo, oe ) distribution. mo is the prior mean and Ko is how strongly we 
believe this; oå is the prior variance and vo is how strongly we believe this. (a) mo = 0, ko = 1, vo = 
i; on = 1. Notice that the contour plot (underneath the surface) is shaped like a “squashed egg”. (b) We 
increase the strength of our belief in the mean, so it gets narrower: mo = 0, ko = 5, vo = 1,08 = 1. (c) 
We increase the strength of our belief in the variance, so it gets narrower: mo = 0, ko = 1, vo = 5, og 

1. Figure generated by NIXdemo2. 


Posterior predictive 


The posterior predictive is given by 


p(x, D) 
p(x{D) = (4.220) 
OP) =O) 
so it can be easily evaluated in terms of a ratio of marginal likelihoods. 
It turns out that this ratio has the form of a multivariate Student-T distribution: 
Pp(x|D) = J [New X)NIW(p, &|my, KN, VN, Sn)dpdd (4.221) 
I 
E Sq i= Dei (4.222) 


K N(UN -D+ 1) 
The Student-T has wider tails than a Gaussian, which takes into account the fact that X is 
unknown. However, this rapidly becomes Gaussian-like. 

Posterior for scalar data 


We now specialise the above results to the case where x; is ld. These results are widely used 
in the statistics literature. As in Section 4.6.2.2, it is conventional not to use the normal inverse 
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Wishart, but to use the normal inverse chi-squared or NIX distribution, defined by 


NIX? (1,07 |mo, Ko, vo, og) = N (u|mo, 07 /K0) x *(07|v0, 05) (4.223) 
1 2 r _ 2 
e (48)? ee ( — 20% 7 (= (4.224) 
o? 202 


See Figure 4.19 for some plots. Along the u axis, the distribution is shaped like a Gaussian, and 
along the g? axis, the distribution is shaped like a x7?; the contours of the joint density have 
a “squashed egg” appearance. Interestingly, we see that the contours for j are more peaked 
for small values of o?, which makes sense, since if the data is low variance, we will be able to 
estimate its mean more reliably. 

One can show that the posterior is given by 


plu, 07|D) = NIX? (u,07|my, KN, UN, ON) (4.225) 
NT 
my = SOT (4.226) 
KN 
KN = ko+N (4.227) 
Vn = +N (4.228) 
2 Nk 
2 2 =e 0 a2 
UNO = 00 + zi — T) + m T (4.229) 
woh = wot) eap goa) 


The posterior marginal for g? is just 


pD) = f plu o?Dydu = x-*(0lvw. 0%) (4.230) 


with the posterior mean given by E [o?|D] = TENDON: 


The posterior marginal for u has a Student T distribution, which follows from the scale 
mixture representation of the student: 


plulD) = f plu o?|D)do? = Timu, ok /en vn) 4.231 


with the posterior mean given by E [u|D] = my. 
Let us see how these results look if we use the following uninformative prior: 


P(H, 07) x p(u)p(o?) x o~* x NIX*(u,07| Wo = 0, ko =0,¥0=—1,05=0) — (4.232) 
With this prior, the posterior has the form 


plm,’ |D) = NIx?(u,0?|my =2,kn = N,vy =N —-1,0% = 8”) (4.233) 
where 
N 
1 N 
2 a 5 ; Ai =2 
Ss = NI 2a T) = N qme (4.234) 


is the the sample standard deviation. (In Section 6.4.2, we show that this is an unbiased 
estimate of the variance.) Hence the marginal posterior for the mean is given by 


2 
plu\D) = T (ule, A N-1) (4.235) 
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and the posterior variance of ju is 


UN ga Nal 2 
pe 2 N N-3N N 


var [uD] = (4.236) 


The square root of this is called the standard error of the mean: 


var [uD] ~ x (4.237) 


Thus an approximate 95% posterior credible interval for the mean is 


(Bayesian credible intervals are discussed in more detail in Section 5.2.2; they are contrasted 
with frequentist confidence intervals in Section 6.6.1.) 


Bayesian t-test 


Suppose we want to test the hypothesis that u 4 lo for some known value po (often 0), given 
values x; ~ N (1,07). This is called a two-sided, one-sample t-test. A simple way to perform 
such a test is just to check if uo € Ip.95(u|D). If it is not, then we can be 95% sure that 
u Æ [o.4 A more common scenario is when we want to test if two paired samples have 
the same mean. More precisely, suppose y; ~ N (u1,0°) and z; ~ N (12,07). We want to 
determine if u = pı — H2 > 0, using x; = y; — 2; as our data. We can evaluate this quantity 
as follows: 


co 


pu > wolD) =f pluiD)du (4.239) 


Ho 


This is called a one-sided, paired t-test. (For a similar approach to unpaired tests, comparing 
the difference in binomial proportions, see Section 5.2.3.) 

To calculate the posterior, we must specify a prior. Suppose we use an uninformative prior. 
As we showed above, we find that the posterior marginal on p has the form 


2 
P(wID) = T (ule, = N-1) (4.240) 


Now let us define the following t statistic: 


= - (4.241) 


where the denominator is the standard error of the mean. We see that 
p(u|P) = 1 — Fy- (t) (4.242) 


where F,„(t) is the cdf of the standard Student t distribution 7 (0, 1, v). 


4. A more complex approach is to perform Bayesian model comparison. That is, we compute the Bayes factor (described 
in Section 5.3.3) p(D|Ho)/p(P|H1), where Ho is the point null hypothesis that u = po, and Hy is the alternative 
hypothesis that u ~ uo. See (Gonen et al. 2005; Rouder et al. 2009) for details. 
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4.6.3.9 Connection with frequentist statistics * 


If we use an uninformative prior, it turns out that the above Bayesian analysis gives the same 
result as derived using frequentist methods. (We discuss frequentist statistics in Chapter 6.) 
Specifically, from the above results, we see that 

-T 


\/s/N 


This has the same form as the sampling distribution of the MLE: 


ID ~ tyi (4.243) 


Js N" N-1 


The reason is that the Student distribution is symmetric in its first two arguments, so 7 (Z|, 07, v) = 
T (u|z, o°, v); hence statements about the posterior for u have the same form as statements 
about the sampling distribution of x. Consequently, the (one-sided) p-value (defined in Sec- 
tion 6.6.2) returned by a frequentist test is the same as p(u > uo|D) returned by the Bayesian 
method. See bayesTtestDemo for an example. 

Despite the superficial similarity, these two results have a different interpretation: in the 
Bayesian approach, jz is unknown and 7 is fixed, whereas in the frequentist approach, X 
is unknown and p is fixed. More equivalences between frequentist and Bayesian inference 
in simple models using uninformative priors can be found in (Box and Tiao 1973). See also 
Section 7.6.3.3. 


(4.244) 


4.6.4 Sensor fusion with unknown precisions * 


In this section, we apply the results in Section 4.6.3 to the problem of sensor fusion in the 
case where the precision of each measurement device is unknown. This generalizes the results 
of Section 4.4.2.2, where the measurement model was assumed to be Gaussian with known 
precision. The unknown precision case turns out to give qualitatively different results, yielding 
a potentially multi-modal posterior as we will see. Our presentation is based on (Minka 200le). 

Suppose we want to pool data from multiple sources to estimate some quantity u € R, but the 
reliability of the sources is unknown. Specifically, suppose we have two different measurement 
devices, x and y, with different precisions: ziju ~ N(u,Az') and yi|u ~ N (p, Azt). We 
make two independent measurements with each device, which turn out to be 


ty = 1.1, x2 = 1.9, y1 = 2.9, Y2 = 4.1 (4.245) 


We will use a non-informative prior for u, p(jz) x 1, which we can emulate using an infinitely 
broad Gaussian, p() = N (u|mo = 0,5 + = 00). If the \,, and Ày terms were known, then 
the posterior would be Gaussian: 


PMID, Aw, Ay) = N(ulmn, àn) (4.246) 
Aw = Do Me +N Ay (4.247) 
ae = Cee (4.248) 


Nolet hay 
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where N, = 2 is the number of x measurements, Ny = 2 is the number of y measurements, 
t= we ya zi = 1.5-and y = Ne ee Yi = 3.5. This result follows because the posterior 
precision is the sum of the measurement precisions, and the posterior mean is a weighted sum 
of the prior mean (which is 0) and the data means. 

However, the measurement precisions are not known. Initially we will estimate them by 
maximum likelihood. The log-likelihood is given by 


Àz Sains Ày 2 
lln As Ay) = log re — SD i- u)? + log ry — F) ui- n) (4.249) 


i a 


The MLE is obtained by solving the following simultaneous equations: 


ol 
T = ANz(®— u) +rAyNyG- vw) =0 (4.250) 
ae i te 
a = S- (ai — pn)? =0 (4.251) 
Ore Ar Ne & 
N. 
Əl ae 2 
a a X mi-n)? =0 (4.252) 
i=1 
This gives 
xe DaT ENAN (4.253) 
Nona Nysy 
‘ 1 
ip. = oh (x; — ja)? (4.254) 
1 a\2 
1/ = g Du - A) (4.255) 
y 


We notice that the MLE for u has the same form as the posterior mean, my. 

We can solve these equations by fixed point iteration. Let us initialize by estimating A,, = 1/s 
and Ay = lyse where s2 = RA ye, (ws — T7)? = 0.16 and s7 = Ne oe (vi — 7}? = 0.36. 
Using this, we get fi = 2.1154, so p(u|D, Ax, Ay) = N (u|2.1154, 0.0554). If we now iterate, 
we converge to As = 1/0.1662, Ay = 1/4.0509, p(u|D, Ax, Ay) = N (u|1.5788, 0.0798). 

The plug-in approximation to the posterior is plotted in Figure 4.20(a). This weights each 
sensor according to its estimated precision. Since sensor y was estimated to be much less 


reliable than sensor x, we have E [uID. ae Ay] = T, so we effectively ignore the y sensor. 


Now we will adopt a Bayesian approach and integrate out the unknown precisions, rather 
than trying to estimate them. That is, we compute 


viwiD) ox pu) | | r(De lus AeA] | f PDs AAAs (4.256) 


We will use uninformative Jeffrey's priors, p(w) x 1, p(As|u) x 1/Az and p(Ay|u) x 1/Ay. 
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Since the x and y terms are symmetric, we will just focus on one of them. The key integral is 


T= | pDsly,As)PQelH) drs x pona” (4.257) 
Nz 


Nz 
exp ( 5 Ax(E — p}? a 2.) dx (4.258) 


Exploiting the fact that N, = 2 this simplifies to 
I= pon exp(—Az[(E — u)? + 8s2])dàz (4.259) 


We recognize this as proportional to the integral of an unnormalized Gamma density 
Ga(Aļa, b) « A? te7> (4.260) 


where a = 1 and b = (x — u)? + s2. Hence the integral is proportional to the normalizing 
constant of the Gamma distribution, T (a)b7®, so we get 


Ix TECA Ax)PAcl|u)drA, x (Œ- u)? +s) (4.261) 
and the posterior becomes 
1 1 
plu|D) x — (4.262) 


(Fp)? +s YH)? + 8% 


The exact posterior is plotted in Figure 4.20(b). We see that it has two modes, one near 
z = 1.5 and one near y = 3.5. These correspond to the beliefs that the x sensor is more 
reliable than the y one, and vice versa. The weight of the first mode is larger, since the data 
from the x sensor agree more with each other, so it seems slightly more likely that the x sensor 
is the reliable one. (They obviously cannot both be reliable, since they disagree on the values 
that they are reporting.) However, the Bayesian solution keeps open the possibility that the y 
sensor is the more reliable one; from two measurements, we cannot tell, and choosing just the 
x sensor, as the plug-in approximation does, results in over confidence (a posterior that is too 
narrow). 


Exercises 


Exercise 4.1 Uncorrelated does not imply independent 

Let X ~ U(—1,1) and Y = X?. Clearly Y is dependent on X (in fact, Y is uniquely determined 
by X). However, show that p(X,Y) = 0. Hint: if X ~ U(a,b) then E[X] = (a+ b)/2 and 
var |X] = (b — a)? /12. 


Exercise 4.2 Uncorrelated and Gaussian does not imply independent unless jointly Gaussian 


Let X ~ N (0,1) and Y = WX, where p(W 1) = p(W = 1) = 0.5. It is clear that X and Y are 
not independent, since Y is a function of X. 


a. Show Y ~ N (0,1). 
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Figure 4.20 Posterior for u. (a) Plug-in approximation. (b) Exact posterior. Figure generated by 
sensorFusionUnknownPrec. 


b. Show cov [X, Y] = 0. Thus X and Y are uncorrelated but dependent, even though they are Gaussian. 
Hint: use the definition of covariance 


cov [X,Y] =E[XY]-E[X]E[Y] (4.263) 
and the rule of iterated expectation 
E[XY]=E{E[XY|W]| (4.264) 


Exercise 4.3 Correlation coefficient is between -1 and +1 
Prove that —1 < p(X,Y) <1 


Exercise 4.4 Correlation coefficient for linearly related variables is +1 
Show that, if Y = aX + b for some parameters a > 0 and b, then p(X,Y) = 1. Similarly show that if 
a < 0, then p(X,Y) = —1. 

Exercise 4.5 Normalization constant for a multidimensional Gaussian 

Prove that the normalization constant for a d-dimensional Gaussian is given by 


1 2 
(2n)4/?|5]2 = ferie =p) SE" (x = pdx (4.265) 


Hint: diagonalize and use the fact that || = Į]; à: to write the joint pdf as a product of d one- 
dimensional Gaussians in a transformed coordinate system. (You will need the change of variables formula.) 
Finally, use the normalization constant for univariate Gaussians. 


Exercise 4.6 Bivariate Gaussian 
Let x ~ N (u, ©) where x € R? and 


f= ( o feu (4.266) 
po102 oF ` 
where p is the correlation coefficient. Show that the pdf is given by 
1 
p(zı, £2) = (4.267) 


2ra102,/1 — p? 


2a ( z Z a (= =a 4 (2 = ey’ op @ — Ha) (z2 - t2) Jaos) 


oi 05 O1 02 
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Figure 4.21 (a) Height/weight data for the men. (b) Standardized. (c) Whitened. 


Exercise 4.7 Conditioning a bivariate Gaussian 


Consider a bivariate Gaussian distribution p(a1, £2) = N(2|u, ©) where 


2 ail 
Ds (z 72) = 0102 e A (4.269) 
021 02 P eae 


where the correlation coefficient is given by 


A 912 


p (4.270) 


0102 


a. What is P(X2|21)? Simplify your answer by expressing it in terms of p, 02, 71, [1,42 and z1. 
b. Assume o1 = o2 = 1. What is P(X2|x1) now? 


Exercise 4.8 Whitening vs standardizing 


a. Load the height/weight data using rawdata = dlmread(’heightWeightData.txt’). The first col- 
umn is the class label (l=male, 2=female), the second column is height, the third weight. Extract the 
height/weight data corresponding to the males. Fit a 2d Gaussian to the male data, using the empirical 
mean and covariance. Plot your Gaussian as an ellipse (use gaussPlot2d), superimposing on your 
scatter plot. It should look like Figure 4.21(a), where have labeled each datapoint by its index. Turn in 
your figure and code. 


b. Standardizing the data means ensuring the empirical variance along each dimension is 1. This can be 
Tij—Tj 


done by computing , where gj is the empirical std of dimension j. Standardize the data and 
replot. It should look like Figure 4.21(b). (Use axis (?equa1’).) Turn in your figure and code. 


c. Whitening or sphereing the data means ensuring its empirical covariance matrix is proportional to 
I, so the data is uncorrelated and of equal variance along each dimension. This can be done by 


computing A-2 UTx for each data vector x, where U are the eigenvectors and A the eigenvalues of 
X. Whiten the data and replot. It should look like Figure 4.21(c). Note that whitening rotates the data, 
so people move to counter-intuitive locations in the new coordinate system (see e.g., person 2, who 
moves from the right hand side to the left). 


Exercise 4.9 Sensor fusion with known variances in 1d 
Suppose we have two sensors with known (and different) variances vı and v2, but unknown (and the same) 
mean p. Suppose we observe nı observations yo? ~ N (u, v1) from the first sensor and n2 observations 
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y” ~ N (u,v2) from the second sensor. (For example, suppose p is the true temperature outside, 


and sensor 1 is a precise (low variance) digital thermosensing device, and sensor 2 is an imprecise (high 
variance) mercury thermometer.) Let D represent all the data from both sensors. What is the posterior 
p(u|D), assuming a non-informative prior for u (which we can simulate using a Gaussian with a precision 
of 0)? Give an explicit expression for the posterior mean and variance. 

Exercise 4.10 Derivation of information form formulae for marginalizing and conditioning 


Derive the information form results of Section 4.3.1. 


Exercise 4.11 Derivation of the NIW posterior 


Derive Equation 4.209. Hint: one can show that 


N(®— #)(X— p)” + ko(u — mo) (u — mo)” (4.271) 


=kn(u—my)(~— my)’ 4 RA (X — mo) (X — mo)” (4.272) 


This is a matrix generalization of an operation called completing the square.” 


Derive the corresponding result for the normal-Wishart model. 


Exercise 4.12 BIC for Gaussians 
(Source: Jaakkola.) 


The Bayesian information criterion (BIC) is a penalized log-likelihood function that can be used for model 
selection (see Section 5.3.2.4). It is defined as 


BIC = log p(D\@mux) — g log(N) (4.273) 


where d is the number of free parameters in the model and N is the number of samples. In this question, 
we will see how to use this to choose between a full covariance Gaussian and a Gaussian with a diagonal 
covariance. Obviously a full covariance Gaussian has higher likelihood, but it may not be “worth” the extra 
parameters if the improvement over a diagonal covariance matrix is too small. So we use the BIC score to 
choose the model. 


Following Section 4.1.3, we can write 


iene: = -Xe (87'S) - Ž hog (1) (4.274) 


Ss = (xi — X)(xi =x (4.275) 


where S is the scatter matrix (empirical covariance), the trace of a matrix is the sum of its diagonals, and 
we have used the trace trick. 


a. Derive the BIC score for a Gaussian in D dimensions with full covariance matrix. Simplify your answer 
as much as possible, exploiting the form of the MLE. Be sure to specify the number of free parameters 


b. Derive the BIC score for a Gaussian in D dimensions with a diagonal covariance matrix. Be sure to 
specify the number of free parameters d. Hint: for the digaonal case, the ML estimate of & is the same 


as X mz except the off-diagonal terms are zero: 


Duis = diag(Syrx(1,1),...,¥.r1(D, D)) (4.276) 
5. In the scalar case, completing the square means rewriting cox? + cya + co as —a(a — b)? + w where a = —c2, 
2 
b= Pen and w = 15 tco 
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Exercise 4.13 Gaussian posterior credible interval 
(Source: DeGroot.) 


Let X ~ N(u,07 = 4) where u is unknown but has prior pp ~ N (0,08 = 9). The posterior after 


seeing n samples is pp ~ N (jin, 0%). (This is called a credible interval, and is the Bayesian analog of a 
confidence interval.) How big does n have to be to ensure 


p(é< un < ulD) > 0.95 (4.277) 


where (£, u) is an interval (centered on un) of width 1 and D is the data. Hint: recall that 95% of the 
probability mass of a Gaussian is within +1.960 of the mean. 


Exercise 4.14 MAP estimation for 1D Gaussians 
(Source: Jaakkola.) 
Consider samples x1,...,2» from a Gaussian random variable with known variance g? and unknown 


mean u. We further assume a prior distribution (also Gaussian) over the mean, pp ~ N (m, s’), with fixed 
mean m and fixed variance s°. Thus the only unknown is u. 


a. Calculate the MAP estimate jiarap. You can state the result without proof. Alternatively, with a lot 
more work, you can compute derivatives of the log posterior, set to zero and solve. 


b. Show that as the number of samples n increase, the MAP estimate converges to the maximum likelihood 
estimate. 

c. Suppose n is small and fixed. What does the MAP estimator converge to if we increase the prior 
variance s*? 

d. Suppose K is small and fixed. What does the MAP estimator converge to if we decrease the prior 
variance s“? 


Exercise 4.15 Sequential (recursive) updating of >>) 
(Source: (Duda et al. 2001, Q3.35,3.36).) 
The unbiased estimates for the covariance of a d-dimensional Gaussian based on n samples is given by 


R 1 = . 
s=C, Ss D(x m,)(x; — Mn) (4.278) 


It is clear that it takes O(nd?) time to compute C,,. If the data points arrive one at a time, it is more 
efficient to incrementally update these estimates than to recompute from scratch. 


a. Show that the covariance can be sequentially udpated as follows 


n—-1 Į 
Cans = ——Cn 4 ent my )(Xn41 — mn)” (4.279) 


b. How much time does it take per sequential update? (Use big-O notation.) 
c. Show that we can sequentially update the precision matrix using 


n -1 O CR (%n41 — Mn)(Xn41 — Mn)" Cz" 
n—-1|]-” nel 
n 


Caii = (4.280) 


+ (Xn41 nm mn)? Cr! (Xn44 B Mn) 


Hint: notice that the update to C,+1 consists of adding a rank-one matrix, namely uu”, where 
u = Xn+1 — Mn. Use the matrix inversion lemma for rank-one updates (Equation 4.111), which we 
repeat here for convenience: 
E-luv’E"! 
E+uv’)'| = E! 4.281 
( ) 1+vTE-tu í ) 
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d. What is the time complexity per update? 


Exercise 4.16 Likelihood ratio for Gaussians 


Source: Source: Alpaydin p103 ex 4. Consider a binary classifier where the K class conditional densities 
are MVN p(a|y = j) = N (x|uj, £j). By Bayes rule, we have 


(y=l1|r) _ g Play = 1) ma =1) 
(y = O|x) p(aly = 0) ply = 0) 


In other words, the log posterior ratio is the log likelihood ratio plus the log prior ratio. For each of the 4 


cases in the table below, derive an expression for the log likelihood ratio log = a) simplifying as much 


log ? (4.282) 
p 


as possible. 


Form of Xj Cov Num parameters 
Arbitrary X; Kd(d + 1)/2 
Shared =E d(d + 1)/2 
Shared, axis-aligned X; = X with X;; = 0 fori j d 

Shared, spherical Èj = oI 1 


Exercise 4.17 LDA/QDA on height/weight data 

The function discrimAnalysisHeightWeightDemo fits an LDA and QDA model to the height/weight 
data. Compute the misclassification rate of both of these models on the training set. Turn in your numbers 
and code. 

Exercise 4.18 Naive Bayes with mixed features 


Consider a 3 class naive Bayes classifier with one binary feature and one Gaussian feature: 
y ~ Mu(y|a, 1), zily = c ~ Ber(z1|8e), zay = c ~ N (wale, 02) (4.283) 
Let the parameter vectors be as follows: 


m = (0.5, 0.25,0.25), @ = (0.5,0.5,0.5), u = (—1,0,1), o° = (1,1,1) (4.284) 


a. Compute p(y|xı = 0, x2 = 0) (the result should be a vector of 3 numbers that sums to 1). 
b. Compute p(y|xzı = 0). 
c. Compute p(y|x2 = 0). 


d. Explain any interesting patterns you see in your results. Hint: look at the parameter vector 0. 


Exercise 4.19 Decision boundary for LDA with semi tied covariances 


Consider a generative classifier with class conditional densities of the form N (x|p., %0). In LDA, we 
assume X. = ©, and in QDA, each X. is arbitrary. Here we consider the 2 class case in which 
1 = kXo, for k > 1. That is, the Gaussian ellipsoids have the same “shape”, but the one for class 1 


is “wider”. Derive an expression for p(y = 1|x,@), simplifying as much as possible. Give a geometric 
interpretation of your result, if possible. 


Exercise 4.20 Logistic regression vs LDA/QDA 

(Source: Jaakkola.) Suppose we train the following binary classifiers via maximum likelihood. 

a. Gaussl: A generative classifier, where the class conditional densities are Gaussian, with both covariance 
matrices set to I (identity matrix), i.e., p(x|y = c) = N (x|u., 1). We assume p(y) is uniform. 


b. GaussX: as for Gaussl, but the covariance matrices are unconstrained, i.e., p(x|y = c) = N (x| u., He). 
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c. LinLog: A logistic regression model with linear features. 
d. QuadLog: A logistic regression model, using linear and quadratic features (i.e., polynomial basis function 
expansion of degree 2). 


After training we compute the performance of each model M on the training set as follows: 


L(M) = 1 X log p(yi|x:, 8, M) (4.285) 


i=1 


(Note that this is the conditional log-likelihood p(y|x,@) and not the joint log-likelihood p(y, x|@).) We 
now want to compare the performance of each model. We will write L(M) < L(M') if model M must 
have lower (or equal) log likelihood (on the training set) than M "for any training set (in other words, M is 
worse than M’, at least as far as training set logprob is concerned). For each of the following model pairs, 
state whether L(M) < L(M"'), L(M) > L(M’), or whether no such statement can be made (i.e, M 
might sometimes be better than M’ and sometimes worse); also, for each question, briefly (1-2 sentences) 
explain why. 


. Gaussl, LinLog. 

. GaussX, QuadLog. 
c. LinLog, QuadLog. 

. Gaussl, QuadLog. 


e. Now suppose we measure performance in terms of the average misclassification rate on the training 
set: 


o pw 


a 


1 n 
M)=— I(yi A G(X: 4.286 
RM) = YH # 6%) (4.286 
Is it true in general that L(M) > L(M") implies that R(M) < R(M’)? Explain why or why not. 
Exercise 4.21 Gaussian decision boundaries 


(Source: (Duda et al. 2001, Q3.7).) Let p(aly = j) = N(x|uj,0;) where j = 1,2 and pr = 0,07 = 
1, u2 = 1,03 = 10°. Let the class priors be equal, p(y = 1) = p(y = 2) = 0.5. 


a. Find the decision region 


Ry = {x: p(z|H1,01) > p(z|H2, 02) } (4.287) 
Sketch the result. Hint: draw the curves and find where they intersect. Find both solutions of the 
equation 

P(2|H1,01) = p(z|u2, 02) (4.288) 


Hint: recall that to solve a quadratic equation aa” + bx + c = 0, we use 


=h a he — 
aE > aac (4.289) 
a 


b. Now suppose o2 = 1 (and all other parameters remain the same). What is Rı in this case? 
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Exercise 4.22 QDA with 3 classes 


Consider a three category classification problem. Let the prior probabilites: 


P(Y =1) = P(Y = 2) = P(Y = 3) = 1/3 


The class-conditional densities are multivariate normal densities with parameters: 


Hi. (0, 0)”, 2 = (1, 1", u3 = [-1,1]7 


Classify the following points: 


a. x = [—0.5, 0.5] 
b. x = (0.5, 0.5] 


Exercise 4.23 Scalar QDA 
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(4.290) 


(4.291) 


(4.292) 


[Note: you can solve this exercise by hand or using a computer (matlab, R, whatever). In either case, show 
your work.] Consider the following training set of heights x (in inches) and gender y (male/female) of some 


US college students: x = (67, 79, 71, 68, 67,60), y = (m,m, m, f, f, f). 


a. Fit a Bayes classifier to this data, using maximum likelihood estimation, i.e., estimate the parameters of 


the class conditional likelihoods 
p(xly = c) = N (z; He, cc) 
and the class prior 


Py =C) = Te 


(4.293) 


(4.294) 


What are your values of Wc, Oc, Te for c = m, f? Show your work (so you can get partial credit if you 


make an arithmetic error). 


b. Compute p(y = mlx,0), where x = 72, and Ê are the MLE parameters. (This is called a plug-in 


prediction.) 


c. What would be a simple way to extend this technique if you had multiple attributes per person, such 


as height and weight? Write down your proposed model as an equation. 


5.1 


5.2 


5.2.1 


Bayesian statistics 


Introduction 


We have now seen a variety of different probability models, and we have discussed how to 
fit them to data, i.e. we have discussed how to compute MAP parameter estimates ô = 
argmax p(ð0|D), using a variety of different priors. We have also discussed how to compute 
the full posterior p(@|D), as well as the posterior predictive density, p(x|D), for certain special 
cases (and in later chapters, we will discuss algorithms for the general case). 

Using the posterior distribution to summarize everything we know about a set of unknown 
variables is at the core of Bayesian statistics. In this chapter, we discuss this approach to 
statistics in more detail. In Chapter 6, we discuss an alternative approach to statistics known as 
frequentist or classical statistics. 


Summarizing posterior distributions 


The posterior p(@|D) summarizes everything we know about the unknown quantities 0. In this 
section, we discuss some simple quantities that can be derived from a probability distribution, 
such as a posterior. These summary statistics are often easier to understand and visualize than 
the full joint. 


MAP estimation 


We can easily compute a point estimate of an unknown quantity by computing the posterior 
mean, median or mode. In Section 5.7, we discuss how to use decision theory to choose between 
these methods. Typically the posterior mean or median is the most appropriate choice for a real- 
valued quantity, and the vector of posterior marginals is the best choice for a discrete quantity. 
However, the posterior mode, aka the MAP estimate, is the most popular choice because it 
reduces to an optimization problem, for which efficient algorithms often exist. Futhermore, MAP 
estimation can be interpreted in non-Bayesian terms, by thinking of the log prior as a regularizer 
(see Section 6.5 for more details). 

Although this approach is computationally appealing, it is important to point out that there 
are various drawbacks to MAP estimation, which we briefly discuss below. This will provide 
motivation for the more thoroughly Bayesian approach which we will study later in this chapter 
(and elsewhere in this book). 


5.2.1.1 


5.2.1.2 


5.2.1.3 
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Figure 5.1 (a) A bimodal distribution in which the mode is very untypical of the distribution. The thin 
blue vertical line is the mean, which is arguably a better summary of the distribution, since it is near the 
majority of the probability mass. Figure generated by bimodalDemo. (b) A skewed distribution in which 
the mode is quite different from the mean. Figure generated by gammaPlotDemo. 


No measure of uncertainty 


The most obvious drawback of MAP estimation, and indeed of any other point estimate such 
as the posterior mean or median, is that it does not provide any measure of uncertainty. In 
many applications, it is important to know how much one can trust a given estimate. We can 
derive such confidence measures from the posterior, as we discuss in Section 5.2.2. 


Plugging in the MAP estimate can result in overfitting 


In machine learning, we often care more about predictive accuracy than in interpreting the 
parameters of our models. However, if we don’t model the uncertainty in our parameters, then 
our predictive distribution will be overconfident. We saw several examples of this in Chapter 3, 
and we will see more examples later. Overconfidence in predictions is particularly problematic 
in situations where we may be risk averse; see Section 5.7 for details. 


The mode is an untypical point 


Choosing the mode as a summary of a posterior distribution is often a very poor choice, since 
the mode is usually quite untypical of the distribution, unlike the mean or median. This is 
illustrated in Figure 5.1(a) for a 1d continuous space. The basic problem is that the mode is a 
point of measure zero, whereas the mean and median take the volume of the space into account. 
Another example is shown in Figure 5.1(b): here the mode is 0, but the mean is non-zero. Such 
skewed distributions often arise when inferring variance parameters, especially in hierarchical 
models. In such cases the MAP estimate (and hence the MLE) is obviously a very bad estimate. 

How should we summarize a posterior if the mode is not a good choice? The answer is to 
use decision theory, which we discuss in Section 5.7. The basic idea is to specify a loss function, 
where L(6,6) is the loss you incur if the truth is @ and your estimate is Ê. If we use 0-1 loss, 
L(0,0) =1(6 Æ 6), then the optimal estimate is the posterior mode. 0-1 loss means you only 
get “points” if you make no errors, otherwise you get nothing: there is no “partial credit” under 
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Figure 5.2 Example of the transformation of a density under a nonlinear transform. Note how the mode 
of the transformed distribution is not the transform of the original mode. Based on Exercise 1.4 of (Bishop 
2006b). Figure generated by bayesChangeOfVar. 


this loss function! For continuous-valued quantities, we often prefer to use squared error loss, 
L (0,0) = (@—6)?; the corresponding optimal estimator is then the posterior mean, as we show 
in Section 5.7. Or we can use a more robust loss function, L(0, Ê) = |8 — 6|, which gives rise to 
the posterior median. 


MAP estimation is not invariant to reparameterization * 


A more subtle problem with MAP estimation is that the result we get depends on how we pa- 
rameterize the probability distribution. Changing from one representation to another equivalent 
representation changes the result, which is not very desirable, since the units of measurement 
are arbitrary (e.g., when measuring distance, we can use centimetres or inches). 

To understand the problem, suppose we compute the posterior for x. If we define y = f(x), 
the distribution for y is given by Equation 2.87, which we repeat here for convenience: 


Py(y) = Px(x)|— (5.1 


The A term is called the Jacobian, and it measures the change in size of a unit volume passed 


through f. Let ĉ = argmax,„Ps(x) be the MAP estimate for x. In general it is not the case 
that 7 = argmax, p,(y) is given by f(ĉ). For example, let x ~ N (6, 1) and y = f(x), where 


1 


PES 1 +exp(—z + 5) 


(5.2) 
We can derive the distribution of y using Monte Carlo simulation (see Section 2.7.1). The result 
is shown in Figure 5.2. We see that the original Gaussian has become “squashed” by the sigmoid 
nonlinearity. In particular, we see that the mode of the transformed distribution is not equal to 
the transform of the original mode. 
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To see how this problem arises in the context of MAP estimation, consider the following 
example, due to Michael Jordan. The Bernoulli distribution is typically parameterized by its 
mean u, so p(y = 1|u) = u, where y € {0,1}. Suppose we have a uniform prior on the 
unit interval: p,(j) = 1 1(0 < w < 1). If there is no data, the MAP estimate is just the 
mode of the prior, which can be anywhere between 0 and 1. We will now show that different 
parameterizations can pick different points in this interval arbitrarily. 

First let 9 = \/ji so js = 8°. The new prior is 


du 
Po(9) = palu) g = 26 (5.3) 
for 0 € [0, 1] so the new mode is 
Owap =arg max 20 =1 (5.4) 
0€[0,1] 
Now let ¢ = 1 — \/1 — u. The new prior is 
du 
Pol?) = pul = 201 — 4) 6.5) 
for œ € [0, 1], so the new mode is 
ONAP =arg max 2—2¢6=0 (5.6) 
€ [0,1] 


Thus the MAP estimate depends on the parameterization. The MLE does not suffer from this 
since the likelihood is a function, not a probability density. Bayesian inference does not suffer 
from this problem either, since the change of measure is taken into account when integrating 
over the parameter space. 

One solution to the problem is to optimize the following objective function: 


6 = argmax p(D|6)p(0)|1(8)|~ 2 (6.7) 
6 


Here I(@) is the Fisher information matrix associated with p(x|@) (see Section 6.2.2). This 
estimate is parameterization independent, for reasons explained in Jermyn 2005; Druilhet and 
Marin 2007). Unfortunately, optimizing Equation 5.7 is often difficult, which minimizes the 
appeal of the whole approach. 


Credible intervals 


In addition to point estimates, we often want a measure of confidence. A standard measure of 
confidence in some (scalar) quantity @ is the “width” of its posterior distribution. This can be 
measured using a 100(1 — a)% credible interval, which is a (contiguous) region C = (4, u) 
(standing for lower and upper) which contains 1 — a of the posterior probability mass, i.e., 


Co(D) = (€,u): P(C < @0<u/D)=1-a (5.8) 


There may be many such intervals, so we choose one such that there is (1 — a)/2 mass in each 
tail; this is called a central interval. 
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Figure 5.3 (a) Central interval and (b) HPD region for a Beta(3,9) posterior. The CI is (0.06, 0.52) and the 
HPD is (0.04, 0.48). Based on Figure 3.6 of (Hoff 2009). Figure generated by betaHPD. 


If the posterior has a known functional form, we can compute the posterior central interval 
using l = F~!(a/2) and u = F~'(1—a/2), where F is the cdf of the posterior. For example, if 
the posterior is Gaussian, p(9|D) = M (0, 1), and a = 0.05, then we have £ = ®(a/2) = —1.96, 
and u = ®(1 — a/2) = 1.96, where ® denotes the cdf of the Gaussian. This is illustrated in 
Figure 2.3(c). This justifies the common practice of quoting a credible interval in the form of 
u + 20, where u represents the posterior mean, o represents the posterior standard deviation, 
and 2 is a good approximation to 1.96. 

Of course, the posterior is not always Gaussian. For example, in our coin example, if we 
use a uniform prior and we observe N; = 47 heads out of N = 100 trials, then the posterior 
is a beta distribution, p(@|D) = Beta(48, 54). We find the 95% posterior credible interval is 
(0.3749, 0.5673) (see betaCredibleInt for the one line of Matlab code we used to compute 
this). 

If we don't know the functional form, but we can draw samples from the posterior, then we 
can use a Monte Carlo approximation to the posterior quantiles: we simply sort the S samples, 
and find the one that occurs at location a/S along the sorted list. As S — 00, this converges 
to the true quantile. See mcQuantileDemo for a demo. 

People often confuse Bayesian credible intervals with frequentist confidence intervals. How- 
ever, they are not the same thing, as we discuss in Section 6.6.1. In general, credible intervals are 
usually what people want to compute, but confidence intervals are usually what they actually 
compute, because most people are taught frequentist statistics but not Bayesian statistics. Fortu- 
nately, the mechanics of computing a credible interval is just as easy as computing a confidence 
interval (see e.g., betaCredibleInt for how to do it in Matlab). 


Highest posterior density regions * 


A problem with central intervals is that there might be points outside the CI which have higher 
probability density. This is illustrated in Figure 5.3(a), where we see that points outside the 
left-most CI boundary have higher density than those just inside the right-most CI boundary. 
This motivates an alternative quantity known as the highest posterior density or HPD region. 
This is defined as the (set of) most probable points that in total constitute 100(1 — a)% of the 
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Figure 5.4 (a) Central interval and (b) HPD region for a hypothetical multimodal posterior. Based on 
Figure 2.2 of (Gelman et al. 2004). Figure generated by postDensityIntervals. 


probability mass. More formally, we find the threshold p* on the pdf such that 


l-a= I p(6|D)de 6.9) 
0:p(6|D)>p* 
and then define the HPD as 
Ca(D) = {0 : p(@|D) = p*} (5.10) 


In ld, the HPD region is sometimes called a highest density interval or HDI. For example, 
Figure 5.3(b) shows the 95% HDI of a Beta(3,9) distribution, which is (0.04, 0.48). We see that 
this is narrower than the CI, even though it still contains 95% of the mass; furthermore, every 
point inside of it has higher density than every point outside of it. 

For a unimodal distribution, the HDI will be the narrowest interval around the mode contain- 
ing 95% of the mass. To see this, imagine “water filling” in reverse, where we lower the level 
until 95% of the mass is revealed, and only 5% is submerged. This gives a simple algorithm for 
computing HDIs in the ld case: simply search over points such that the interval contains 95% 
of the mass and has minimal width. This can be done by 1d numerical optimization if we know 
the inverse CDF of the distribution, or by search over the sorted data points if we have a bag of 
samples (see betaHPD for a demo). 

If the posterior is multimodal, the HDI may not even be a connected region: see Figure 5.4(b) 
for an example. However, summarizing multimodal posteriors is always difficult. 


Inference for a difference in proportions 


Sometimes we have multiple parameters, and we are interested in computing the posterior 
distribution of some function of these parameters. For example, suppose you are about to buy 
something from Amazon.com, and there are two sellers offering it for the same price. Seller 1 
has 90 positive reviews and 10 negative reviews. Seller 2 has 2 positive reviews and 0 negative 
reviews. Who should you buy from?! 


1. This example is from www.johndcook.com/blog/2011/09/27/bayesian-amazon. See also lingpipe-blog.c 
om/2009/10/13/bayesian-counterpart-to-fisher-exact-test-on-contingency-tables. 
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Figure 5.5 (a) Exact posteriors p(@;|D;). (b) Monte Carlo approximation to p(d|D). We use kernel density 
estimation to get a smooth plot. The vertical lines enclose the 95% central interval. Figure generated by 
amazonSellerDemo, 


On the face of it, you should pick seller 2, but we cannot be very confident that seller 2 is 
better since it has had so few reviews. In this section, we sketch a Bayesian analysis of this 
problem. Similar methodology can be used to compare rates or proportions across groups for a 
variety of other settings. 

Let 6; and 62 be the unknown reliabilities of the two sellers. Since we don’t know much 
about them, we'll endow them both with uniform priors, 6; ~ Beta(1,1). The posteriors are 
p(O1|D1) = Beta(91, 11) and p(O2|D2) = Beta(3, 1). 

We want to compute p(#; > 02|D). For convenience, let us define 6 = 6, — 62 as the 
difference in the rates. (Alternatively we might want to work in terms of the log-odds ratio.) We 
can compute the desired quantity using numerical integration: 


il 1 
p(d > 0|D) = f J 1(0, > 62) Beta(61|y1 + 1, Ny = Vi + 1) 
0 y0 
Beta(62|y2 + 1, No — Y2 + 1)d6,d02 (5.11) 


We find p(d > OJD) = 0.710, which means you are better off buying from seller 1! See 
amazonSellerDemo for the code. (It is also possible to solve the integral analytically (Cook 
2005).) 

A simpler way to solve the problem is to approximate the posterior p(d|D) by Monte Carlo 
sampling. This is easy, since 0, and 62 are independent in the posterior, and both have beta 
distributions, which can be sampled from using standard methods. The distributions p(0;|D;) 
are shown in Figure 5.5(a), and a MC approximation to p(d|D), together with a 95% HPD, is 
shown Figure 5.5(b). An MC approximation to p(d > 0|D) is obtained by counting the fraction 
of samples where 6; > 6; this turns out to be 0.718, which is very close to the exact value. (See 
amazonSellerDemo for the code.) 


Bayesian model selection 


In Figure 1.18, we saw that using too high a degree polynomial results in overfitting, and using 
too low a degree results in underfitting. Similarly, in Figure 7.8(a), we saw that using too small 
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a regularization parameter results in overfitting, and too large a value results in underfitting. In 
general, when faced with a set of models (i.e., families of parametric distributions) of different 
complexity, how should we choose the best one? This is called the model selection problem. 

One approach is to use cross-validation to estimate the generalization error of all the candiate 
models, and then to pick the model that seems the best. However, this requires fitting each 
model K times, where K is the number of CV folds. A more efficient approach is to compute 
the posterior over models, 


p(D|m)p(m) 
DNY p(m, D) 
From this, we can easily compute the MAP model, ñ = argmaxp(m|D). This is called 
Bayesian model selection. 


If we use a uniform prior over models, p(m) œ 1, this amounts to picking the model which 
maximizes 


p(D|m) = I p(D|0)p(6|m)a0 6.3) 


p(m|D) = (5.12) 


This quantity is called the marginal likelihood, the integrated likelihood, or the evidence for 
model m. The details on how to perform this integral will be discussed in Section 5.3.2. But 
first we give an intuitive interpretation of what this quantity means. 


Bayesian Occam’s razor 


One might think that using p(D|m) to select models would always favor the model with the 
most parameters. This is true if we use pP(D| Om) to select models, where Ôm is the MLE or 
MAP estimate of the parameters for model m, because models with more parameters will fit the 
data better, and hence achieve higher likelihood. However, if we integrate out the parameters, 
rather than maximizing them, we are automatically protected from overfitting: models with 
more parameters do not necessarily have higher marginal likelihood. This is called the Bayesian 
Occam’s razor effect (MacKay 1995b; Murray and Ghahramani 2005), named after the principle 
known as Occam’s razor, which says one should pick the simplest model that adequately 
explains the data. 

One way to understand the Bayesian Occam’s razor is to notice that the marginal likelihood 
can be rewritten as follows, based on the chain rule of probability (Equation 2.5): 


P(P) = p(y )p(y2ly1)P(ysly1-2) ---P(Yynlyi:w—1) (5.14) 


where we have dropped the conditioning on x for brevity. This is similar to a leave-one-out 
cross-validation estimate (Section 1.4.8) of the likelihood, since we predict each future point given 
all the previous ones. (Of course, the order of the data does not matter in the above expression.) 
If a model is too complex, it will overfit the “early” examples and will then predict the remaining 
ones poorly. 

Another way to understand the Bayesian Occam’s razor effect is to note that probabilities must 
sum to one. Hence `p, p(D’|m) = 1, where the sum is over all possible data sets. Complex 
models, which can predict many things, must spread their probability mass thinly, and hence 
will not obtain as large a probability for any given data set as simpler models. This is sometimes 
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Figure 5.6 A schematic illustration of the Bayesian Occam's razor. The broad (green) curve corresponds 
to a complex model, the narrow (blue) curve to a simple model, and the middle (red) curve is just right. 
Based on Figure 3.13 of (Bishop 2006a). See also (Murray and Ghahramani 2005, Figure 2) for a similar plot 
produced on real data. 


called the conservation of probability mass principle, and is illustrated in Figure 5.6. On the 
horizontal axis we plot all possible data sets in order of increasing complexity (measured in 
some abstract sense). On the vertical axis we plot the predictions of 3 possible models: a simple 
one, Mı; a medium one, M2; and a complex one, M3. We also indicate the actually observed 
data Do by a vertical line. Model 1 is too simple and assigns low probability to Dp. Model 3 
also assigns Do relatively low probability, because it can predict many data sets, and hence it 
spreads its probability quite widely and thinly. Model 2 is “just right”: it predicts the observed 
data with a reasonable degree of confidence, but does not predict too many other things. Hence 
model 2 is the most probable model. 

As a concrete example of the Bayesian Occam’s razor, consider the data in Figure 5.7. We plot 
polynomials of degrees 1, 2 and 3 fit to N = 5 data points. It also shows the posterior over 
models, where we use a Gaussian prior (see Section 7.6 for details). There is not enough data 
to justify a complex model, so the MAP model is d = 1. Figure 5.8 shows what happens when 
N = 30. Now it is clear that d = 2 is the right model (the data was in fact generated from a 
quadratic). 

As another example, Figure 7.8(c) plots log p(D|A) vs log(A), for the polynomial ridge regres- 
sion model, where \ ranges over the same set of values used in the CV experiment. We see 
that the maximum evidence occurs at roughly the same point as the minimum of the test MSE, 
which also corresponds to the point chosen by CV. 

When using the Bayesian approach, we are not restricted to evaluating the evidence at a 
finite grid of values. Instead, we can use numerical optimization to find A* = argmaxy p(D|A). 
This technique is called empirical Bayes or type II maximum likelihood (see Section 5.6 for 
details). An example is shown in Figure 7.8(b): we see that the curve has a similar shape to the 
CV estimate, but it can be computed more efficiently. 
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Figure 5.7 (a-c) We plot polynomials of degrees 1, 2 and 3 fit to N = 5 data points using empirical 
Bayes. The solid green curve is the true function, the dashed red curve is the prediction (dotted blue lines 
represent +ø around the mean). (d) We plot the posterior over models, p(d|D), assuming a uniform prior 
p(d) œ 1. Based on a figure by Zoubin Ghahramani. Figure generated by linregEbModelSe1VsN. 


Computing the marginal likelihood (evidence) 
When discussing parameter inference for a fixed model, we often wrote 
P(8|D, m) x p(O|m)p(D|@, m) (5.15) 


thus ignoring the normalization constant p(D|m). This is valid since p(D|m) is constant wrt 0. 
However, when comparing models, we need to know how to compute the marginal likelihood, 
p(P|m). In general, this can be quite hard, since we have to integrate over all possible parameter 
values, but when we have a conjugate prior, it is easy to compute, as we now show. 

Let p(@) = q(@)/Zp be our prior, where q(@) is an unnormalized distribution, and Zp is 
the normalization constant of the prior. Let p(D|@) = q(D|@)/Z¢ be the likelihood, where Ze 
contains any constant factors in the likelihood. Finally let p(@|D) = q(@|D)/Zyn be our poste- 
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Figure 5.8 Same as Figure 5.7 except now N = 30. Figure generated by linregEbModelSelVsN. 


rior, where g(@|D) = q(D|@)q(@) is the unnormalized posterior, and Zy is the normalization 
constant of the posterior. We have 


p@|D) = meee (5.16) 
q(O|\D) _— q(D|@)q(A) 

ZN ZıZop(D) E 
p(D) = oe (5.18) 


So assuming the relevant normalization constants are tractable, we have an easy way to compute 
the marginal likelihood. We give some examples below. 
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Beta-binomial model 


Let us apply the above result to the Beta-binomial model. Since we know p(0|D) = Beta(0|a’, 0’), 
where a’ = a+ N; and b’ = b+ No, we know the normalization constant of the posterior is 
B(a’',b’). Hence 


PIP) = ee (5.19) 
1 1 a—1 =l N i o 
— p(D) E (i Wo o” a-o] 6.20) 
= N i 1 a+Nı—1/1 _ gyb+No-1 
7 (x) p(D) B(a,b) [o glm l 6.21) 
So 
1 N\ 1 1 
B(a + Nı,b + No) ~ (Aren (5.22) 
PIP) = a ae (5.23) 


The marginal likelihood for the Beta-Bernoulli model is the same as above, except it is missing 
the ( N ) term. 
Dirichlet-multinoulli model 


By the same reasoning as the Beta-Bernoulli case, one can show that the marginal likelihood for 
the Dirichlet-multinoulli model is given by 


(D) = nA (5.24) 
where 
K 
B(a) = Tis Mon) (5.25) 


Pie ak) 


Hence we can rewrite the above result in the following form, which is what is usually presented 
in the literature: 


E T pak) Tr(Nk + ap) 
p(D) N+ ea Ta) (5.26) 


We will see many applications of this equation later. 


Gaussian-Gaussian-Wishart model 


Consider the case of an MVN with a conjugate NIW prior. Let Zp be the normalizer for the 
prior, Zy be normalizer for the posterior, and let Z; = (2r) ^ D/2 be the normalizer for the 
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likelihood. Then it is easy to see that 


ZN 
ey, (5.27) 
p(D) ZZ, 
D/2 
1 1 (2) [Sy |752 200+M)D/2T (uy /2) 
= 3ND 9ND/2 D/2 (5.28) 
(22) 7" ISo /222/2Pp(v/2) 
7 1 ki D/2 |So [0/2 l'p(vn /2) aoe 
TND/2 \ kN [Sn |” Tp(vo/2) ` 


This equation will prove useful later. 


BIC approximation to log marginal likelihood 


In general, computing the integral in Equation 5.13 can be quite difficult. One simple but popular 
approximation is known as the Bayesian information criterion or BIC, which has the following 
form (Schwarz 1978): 


>  dof(@ 
BIC = log p(D|@) — oot) log N = log p(D) (5.30) 


where dof(@) is the number of degrees of freedom in the model, and Ê is the MLE for the 
model.’ We see that this has the form of a penalized log likelihood, where the penalty term 
depends on the model's complexity. See Section 8.4.2 for the derivation of the BIC score. 

As an example, consider linear regression. As we show in Section 7.3, the MLE is given by Ww = 
(X?X)-!X7y and ô? = RSS/N, where RSS = Cr —w/!.x;)*. The corresponding 
log likelihood is given by 


r N N 
log p(D|@) = = log(276?) — 5 (5.31) 
Hence the BIC score is as follows (dropping constant terms) 
N D 
BIC === log(?) — z log(N) (5.32) 


where D is the number of variables in the model. In the statistics literature, it is common to 
use an alternative definition of BIC, which we call the BIC cost (since we want to minimize it): 

BIC-cost ê —2 log p(D|@) + dof (Ô) log N ~ —2 log p(D) (5.33) 
In the context of linear regression, this becomes 


BIC-cost = N log(ô?) + Dlog(N) (5.34) 


2. Traditionally the BIC score is defined using the ML estimate Ê, so it is independent of the prior. However, for models 
such as mixtures of Gaussians, the ML estimate can be poorly behaved, so it is better to evaluate the BIC score using 
the MAP estimate, as in (Fraley and Raftery 2007). 
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The BIC method is very closely related to the minimum description length or MDL principle, 
which characterizes the score for a model in terms of how well it fits the data, minus how 
complex the model is to define. See (Hansen and Yu 2001) for details. 

There is a very similar expression to BIC/ MDL called the Akaike information criterion or 
AIC, defined as 


AIC(m, D) £ log p(D|@ riz) — dof(m) (5.35) 


This is derived from a frequentist framework, and cannot be interpreted as an approximation 
to the marginal likelihood. Nevertheless, the form of this expression is very similar to BIC. We 
see that the penalty for AIC is less than for BIC. This causes AIC to pick more complex models. 
However, this can result in better predictive accuracy. See e.g., (Clarke et al. 2009, sec 10.2) for 
further discussion on such information criteria. 


Effect of the prior 


Sometimes it is not clear how to set the prior. When we are performing posterior inference, the 
details of the prior may not matter too much, since the likelihood often overwhelms the prior 
anyway. But when computing the marginal likelihood, the prior plays a much more important 
role, since we are averaging the likelihood over all possible parameter settings, as weighted by 
the prior. 

In Figures 5.7 and 5.8, where we demonstrated model selection for linear regression, we used 
a prior of the form p(w) = .N(0,a~'I). Here a is a tuning parameter that controls how strong 
the prior is. This parameter can have a large effect, as we discuss in Section 7.5. Intuitively, if 
a is large, the weights are “forced” to be small, so we need to use a complex model with many 
small parameters (e.g., a high degree polynomial) to fit the data. Conversely, if œ is small, we 
will favor simpler models, since each parameter is “allowed” to vary in magnitude by a lot. 

If the prior is unknown, the correct Bayesian procedure is to put a prior on the prior. That is, 
we should put a prior on the hyper-parameter a as well as the parametrs w. To compute the 
marginal likelihood, we should integrate out all unknowns, i.e., we should compute 


p(D\m) = f | eemiw)pto|a,m)p(o|m)aweda (5.36) 


Of course, this requires specifying the hyper-prior. Fortunately, the higher up we go in the 
Bayesian hierarchy, the less sensitive are the results to the prior settings. So we can usually 
make the hyper-prior uninformative. 

A computational shortcut is to optimize a rather than integrating it out. That is, we use 


p(Dlm) = f p(Diw)p(wrla,m)aw 637 
where 
â = argmax p(D|a, m) = argmax | p(Djw)p(w|a,m)dw (5.38) 


This approach is called empirical Bayes (EB), and is discussed in more detail in Section 5.6. This 
is the method used in Figures 5.7 and 5.8. 
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Bayes factor BF(1, 0) Interpretation 
BF < en Decisive evidence for Mo 
BF < i0 Strong evidence for Mo 
5 < BF < 3 Moderate evidence for Mo 
5 <BF<1 Weak evidence for Mo 
1<BF<3 Weak evidence for Mı 
3< BF <10 Moderate evidence for Mı 
BF > 10 Strong evidence for Mı 
BF > 100 Decisive evidence for Mı 


Table 5.1 Jeffreys’ scale of evidence for interpreting Bayes factors. 


Bayes factors 


Suppose our prior on models is uniform, p(m) œx 1. Then model selection is equivalent to 
picking the model with the highest marginal likelihood. Now suppose we just have two models 
we are considering, call them the null hypothesis, Mo, and the alternative hypothesis, Mı. 
Define the Bayes factor as the ratio of marginal likelihoods: 

P(D|Mi) _ p(Mi|D) ,p(M1) 


BF; o = (5.39) 
10 Š SDM) ~ p(Mo|D)! po) 


(This is like a likelihood ratio, except we integrate out the parameters, which allows us to 
compare models of different complexity.) If BF, > 1 then we prefer model 1, otherwise we 
prefer model 0. 

Of course, it might be that BF is only slightly greater than 1. In that case, we are not 
very confident that model 1 is better. Jeffreys (1961) proposed a scale of evidence for interpreting 
the magnitude of a Bayes factor, which is shown in Table 5.1. This is a Bayesian alternative to 
the frequentist concept of a p-value. Alternatively, we can just convert the Bayes factor to a 
posterior over models. If p( Mı) = p(Mo) = 0.5, we have 

Boi 1 


Mo|D) = = 5.40 
p(Mo|P) 1+Bmhı BFio+!l om 


Example: Testing if a coin is fair 


Suppose we observe some coin tosses, and want to decide if the data was generated by a fair 
coin, 0 = 0.5, or a potentially biased coin, where @ could be any value in [0,1]. Let us denote 
the first model by Mp and the second model by Mı. The marginal likelihood under Mo is 
simply 


1 N 
p(D|Mo) = (5) (5.41) 


3. A p-value, is defined as the probability (under the null hypothesis) of observing some test statistic f(D) (such as the 
chi-squared statistic) that is as large or larger than that actually observed, i.e., pvalue(D) = P(f(D) > f(D)|D ~ 
Ho). Note that has almost nothing to do with what we really want to know, which is p( Ho|D). 
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Figure 5.9 (a) Log marginal likelihood for the coins example. (b) BIC approximation. Figure generated by 
coinsModelSelDemo. 


where N is the number of coin tosses. The marginal likelihood under M4, using a Beta prior, is 


Bla, + Ni, Q0 + No) 
B(a1, ao) 


PDM) = f PDPO) = 642) 

We plot log p(D|M1) vs the number of heads N; in Figure 5.9(a), assuming N = 5 and 
Q 1 = Qo = 1. (The shape of the curve is not very sensitive to a; and ao, as long as ap = a1.) 
If we observe 2 or 3 heads, the unbiased coin hypothesis Mo is more likely than M1, since Mo 
is a simpler model (it has no free parameters) — it would be a suspicious coincidence if the 
coin were biased but happened to produce almost exactly 50/50 heads/tails. However, as the 
counts become more extreme, we favor the biased coin hypothesis. Note that, if we plot the log 
Bayes factor, log BF 0, it will have exactly the same shape, since log p(D|Mo) is a constant. 
See also Exercise 3.18. 

In Figure 5.9(b) shows the BIC approximation to log p(D|M1) for our biased coin example 
from Section 5.3.3.1. We see that the curve has approximately the same shape as the exact log 
marginal likelihood, which is all that matters for model selection purposes, since the absolute 
scale is irrelevant. In particular, it favors the simpler model unless the data is overwhelmingly 
in support of the more complex model. 


Jeffreys-Lindley paradox * 


Problems can arise when we use improper priors (i.e., priors that do not integrate to 1) for model 
selection/ hypothesis testing, even though such priors may be acceptable for other purposes. For 
example, consider testing the hypotheses Mo : 0 € Og vs Mı : 0 € O4. To define the marginal 
density on 0, we use the following mixture model 


p(0) = p(0\Mo)p(Mo) + p(0|M:ı)p( Mı) (5.43) 
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This is only meaningful if p(0| Mo) and p(0| Mı) are proper (normalized) density functions. In 
this case, the posterior is given by 
Mo) p(P|M 
p(M|D) = l p(Mo)p(D|Mo) (5.44) 
P(Mo)p(P|Mo) + p(M1)p(P|M1) 


p(Mo) Jo, P(P19)P(9| Mo) dé Bi 
p(Mo) Jo, P(PIP)p(4| Mo) dé + p(M1) fo, p(DIO)p(0| M: )d0 : 
Now suppose we use improper priors, p(0|Mo) x co and p(0|M,) œ c1. Then 
p(Mo) Co Jos? (D|0)d0 
Mo|D (5.46) 
p(Mo|P) p(Mo)eo Jo, P(D1A)d0 + p(Mi jer Jo, p(DIA)dd 
p(Mo)colo (5.47) 


P(Mo)colo + p(Mi)erer 
where l; = . o, P(D|0)dð is the integrated or marginal likelihood for model i. Now let p( Mo) = 
p(M,) = 5. Hence 


colo £0 
MoD) = = (5.48) 
p(Mo|P) colo teiti fo + (1/0) 1 


Thus we can change the posterior arbitrarily by choosing cı and co as we please. Note that 
using proper, but very vague, priors can cause similar problems. In particular, the Bayes factor 
will always favor the simpler model, since the probability of the observed data under a complex 
model with a very diffuse prior will be very small. This is called the Jeffreys-Lindley paradox. 

Thus it is important to use proper priors when performing model selection. Note, however, 
that, if Mo and Mı share the same prior over a subset of the parameters, this part of the prior 
can be improper, since the corresponding normalization constant will cancel out. 


Priors 


The most controversial aspect of Bayesian statistics is its reliance on priors. Bayesians argue 
this is unavoidable, since nobody is a tabula rasa or blank slate: all inference must be done 
conditional on certain assumptions about the world. Nevertheless, one might be interested in 
minimizing the impact of one’s prior assumptions. We briefly discuss some ways to do this 
below. 


Uninformative priors 


If we don't have strong beliefs about what 8 should be, it is common to use an uninformative 
or non-informative prior, and to “let the data speak for itself”. 

The issue of designing uninformative priors is actually somewhat tricky. As an example 
of the difficulty, consider a Bernoulli parameter, 9 € [0,1]. One might think that the most 
uninformative prior would be the uniform distribution, Beta(1, 1). But the posterior mean in 
this case is E[6|D] = ~H, whereas the MLE is . Hence one could argue that the 


a] 
: i Niı+No+2’ : Ni+No 
prior wasn't completely uninformative after all. 
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Clearly by decreasing the magnitude of the pseudo counts, we can lessen the impact of the 
prior. By the above argument, the most non-informative prior is 
lim Beta(c, c) = Beta(0, 0) (5.49) 
c>0 
which is a mixture of two equal point masses at 0 and 1 (see (Zhu and Lu 2004)). This is also 
called the Haldane prior. Note that the Haldane prior is an improper prior, meaning it does not 
integrate to 1. However, as long as we see at least one head and at least one tail, the posterior 
will be proper. 

In Section 5.4.2.1 we will argue that the “right” uninformative prior is in fact Beta(5, 5). 
Clearly the difference in practice between these three priors is very likely negligible. In general, 
it is advisable to perform some kind of sensitivity analysis, in which one checks how much 
one’s conclusions or predictions change in response to change in the modeling assumptions, 
which includes the choice of prior, but also the choice of likelihood and any kind of data pre- 
processing. If the conclusions are relatively insensitive to the modeling assumptions, one can 
have more confidence in the results. 


Jeffreys priors * 


Harold Jeffreys’ designed a general purpose technique for creating non-informative priors. The 
result is known as the Jeffreys prior. The key observation is that if p(#) is non-informative, 
then any re-parameterization of the prior, such as 6 = h(@) for some function h, should also 
be non-informative. Now, by the change of variables formula, 


_ do 
po(9) = po (o)l z (5.50) 
so the prior will in general change. However, let us pick 
polo) x (I(¢))? 6.51) 
where J(@) is the Fisher information: 
I(¢) Ê -E | (eee | (5.52) 


This is a measure of curvature of the expected negative log likelihood and hence a measure of 
stability of the MLE (see Section 6.2.2). Now 


dlogp(c|@) _ dlogp(zlé) dé 
dod (oa 


Squaring and taking expectations over x, we have 


10) = | (eee) o el (5.54) 


1(0)? = I(¢)3| “ (5.55) 


4. Harold Jeffreys, 1891 - 1989, was an English mathematician, statistician, geophysicist, and astronomer. 
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so we find the transformed prior is 


So po(@) and pg(¢@) are the same. 
Some examples will make this clearer. 
Example: Jeffreys prior for the Bernoulli and multinoulli 
Suppose X ~ Ber(@). The log likelihood for a single sample is 
log p(X |0) = X log é + (1 — X) log(1 — @) 
The score function is just the gradient of the log-likelihood: 


a d X 1-X 


The observed information is the second derivative of the log-likelihood: 


d , x. 1-3 
— log »(X|8) = -8'(61X) = 5 


a) = EET 


The Fisher information is the expected information: 


1(0) = ElJ(0|X)|X ~ 6] = 2 T Ten: ~ a- 0) 


Hence Jeffreys’ prior is 
1 1 1 1 1 
0) x07 2(1-60) 2 = x Beta(=, = 
pO) «041 — 8) = TR ox Betal5, 5) 
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(5.56) 


(5.57) 


(5.58) 


(5.59) 


(5.60) 


(5.61) 


Now consider a multinoulli random variable with K states. One can show that the Jeffreys’ 


prior is given by 


1 1 
0 Dir(-,...,= 5.62 
p(@) x Dir(5,.--55) (5.62) 
Note that this is different from the more obvious choices of Dir(#,...,) or Dir(1,..., 1). 


Example: Jeffreys prior for location and scale parameters 


One can show that the Jeffreys prior for a location parameter, such as the Gaussian mean, is 
plu) x 1. Thus is an example of a translation invariant prior, which satisfies the property 
that the probability mass assigned to any interval, [A, B] is the same as that assigned to any 


other shifted interval of the same width, such as [A — c, B — c]. That is, 


B—c B 
f. vdn=(4-09-B-0)=(4-B)= f pupa 


A—c A 


(5.63) 
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This can be achieved using p(y) x 1, which we can approximate by using a Gaussian with 
infinite variance, p(j:) = N (u|0, 00). Note that this is an improper prior, since it does not 
integrate to 1. Using improper priors is fine as long as the posterior is proper, which will be the 
case provided we have seen N > 1 data points, since we can “nail down” the location as soon 
as we have seen a single data point. 

Similarly, one can show that the Jeffreys prior for a scale parameter, such as the Gaussian 
variance, is p(o?) x 1/c?. This is an example of a scale invariant prior, which satisfies the 
property that the probability mass assigned to any interval [A,B] is the same as that assigned 
to any other interval [A/c, B/c] which is scaled in size by some constant factor c > 0. (For 
example, if we change units from meters to feet we do not want that to affect our inferences.) 
This can be achieved by using 


p(s) x 1/s (5.64) 
To see this, note that 


B/c 
| p(s)ds = [log s|/1/° = log(B/c) — log(A/c) (5.65) 
Afc 
B 
= log(B) — 1og(4) = f p(s)ds (5.66) 
A 


We can approximate this using a degenerate Gamma distribution (Section 2.4.4), p(s) = Ga(s|0, 0). 
The prior p(s) x 1/s is also improper, but the posterior is proper as soon as we have seen 
N > 2 data points (since we need at least two data points to estimate a variance). 


Robust priors 


In many cases, we are not very confident in our prior, so we want to make sure it does not have 
an undue influence on the result. This can be done by using robust priors (Insua and Ruggeri 
2000), which typically have heavy tails, which avoids forcing things to be too close to the prior 
mean. 

Let us consider an example from (Berger 1985, p7). Suppose x ~ N (0,1). We observe that 
x = 5 and we want to estimate 0. The MLE is of course Ê = 5, which seems reasonable. The 
posterior mean under a uniform prior is also 0 = 5. But now suppose we know that the prior 
median is 0, and the prior quantiles are at -l and 1, so p(0 < —1) = p(-1 < 8 < 0) = p(0 < 
0 <1) =p(1 < 0) = 0.25. Let us also assume the prior is smooth and unimodal. 

It is easy to show that a Gaussian prior of the form M (0|0, 2.19?) satisfies these prior 
constraints. But in this case the posterior mean is given by 3.43, which doesn’t seem very 
satisfactory. 

Now suppose we use as a Cauchy prior 7 (6|0, 1, 1). This also satisfies the prior constraints of 
our example. But this time we find (using numerical method integration: see robustPriorDemo 
for the code) that the posterior mean is about 4.6, which seems much more reasonable. 


Mixtures of conjugate priors 


Robust priors are useful, but can be computationally expensive to use. Conjugate priors simplify 
the computation, but are often not robust, and not flexible enough to encode our prior knowl- 
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edge. However, it turns out that a mixture of conjugate priors is also conjugate (Exercise 5.1), 
and can approximate any kind of prior (Dallal and Hall 1983; Diaconis and Ylvisaker 1985). Thus 
such priors provide a good compromise between computational convenience and flexibility. 

For example, suppose we are modeling coin tosses, and we think the coin is either fair, or 
is biased towards heads. This cannot be represented by a beta distribution. However, we can 
model it using a mixture of two beta distributions. For example, we might use 


p(0) = 0.5 Beta(6|20, 20) + 0.5 Beta(4|30, 10) (5.67) 


If 8 comes from the first distribution, the coin is fair, but if it comes from the second, it is 
biased towards heads. 

We can represent a mixture by introducing a latent indicator variable z, where z = k means 
that 0 comes from mixture component k. The prior has the form 


ple) = X p(z = k)p(Olz = k) (5.68) 
k 


where each p(0|z = k) is conjugate, and p(z = k) are called the (prior) mixing weights. One can 
show (Exercise 5.1) that the posterior can also be written as a mixture of conjugate distributions 
as follows: 


PID) = X` ple =kID)p(O|D, z = k) (5.69) 
k 
where p(Z = k|D) are the posterior mixing weights given by 
p(Z = k)p(D|Z = k) 
Ly pP(Z =F )p(D|Z = k') 


Here the quantity p(D|Z = k) is the marginal likelihood for mixture component k (see Sec- 
tion 5.3.2.1). 


p(Z = k|D) (5.70) 


Example 
Suppose we use the mixture prior 
p0) = 0.5Beta(6|az, b1) + 0.5Beta(0|az, b2) (5.71) 


where a, = bı = 20 and az = bz = 10. and we observe N; heads and No tails. The posterior 
becomes 


p(O|D) = WZ = 1|D)Beta(0|a1 + Ni, by + No) + p(Z = 2|D)Beta(@|a2 + Ni, bo + No)(5.72) 
If Ny = 20 heads and No = 10 tails, then, using Equation 5.23, the posterior becomes 
p(6|D) = 0.346 Beta(0|40, 30) + 0.654 Beta(6|50, 20) (5.73) 


See Figure 5.10 for an illustration. 
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Figure 5.10 A mixture of two Beta distributions. Figure generated by mixBetaDemo. 


Application: Finding conserved regions in DNA and protein sequences 


We mentioned that Dirichlet-multinomial models are widely used in biosequence analysis. Let 
us give a simple example to illustrate some of the machinery that has developed. Specifically, 
consider the sequence logo discussed in Section 2.3.2.1. Now suppose we want to find locations 
which represent coding regions of the genome. Such locations often have the same letter across 
all sequences, because of evolutionary pressure. So we need to find columns which are “pure”, 
or nearly so, in the sense that they are mostly all As, mostly all Ts, mostly all Cs, or mostly all 
Gs. One approach is to look for low-entropy columns; these will be ones whose distribution is 
nearly deterministic (pure). 

But suppose we want to associate a confidence measure with our estimates of purity. This 
can be useful if we believe adjacent locations are conserved together. In this case, we can let 
Zı = 1 if location ¢ is conserved, and let Z; = 0 otherwise. We can then add a dependence 
between adjacent Z; variables using a Markov chain; see Chapter 17 for details. 

In any case, we need to define a likelihood model, p(N;|Z;), where N; is the vector of 
(A,C,G,T) counts for column t. It is natural to make this be a multinomial distribution with 
parameter 0+. Since each column has a different distribution, we will want to integrate out 6; 
and thus compute the marginal likelihood 


P(N: |Z) = J P(N; |0;)p(O4| 2) dO, (5.74) 


But what prior should we use for 0}? When Z, = 0 we can use a uniform prior, p(0|Z; = 0) = 
Dir(1, 1, 1,1), but what should we use if Z; = 1? After all, if the column is conserved, it could 
be a (nearly) pure column of As, Cs, Gs, or Ts. A natural approach is to use a mixture of Dirichlet 
priors, each one of which is “tilted” towards the appropriate corner of the 4-dimensional simplex, 


eg, 
1 1 
p(0|Z: = 1) = {Dir(6|(10, 1,1,1)) +--+ + FDin(|(1, 1, 1, 10)) (5.75) 


Since this is conjugate, we can easily compute p(N;|Z;). See (Brown et al. 1993) for an 
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application of these ideas to a real bio-sequence problem. 


Hierarchical Bayes 


A key requirement for computing the posterior p(0|D) is the specification of a prior p(0|7), 
where 77 are the hyper-parameters. What if we don’t know how to set n? In some cases, we can 
use uninformative priors, we we discussed above. A more Bayesian approach is to put a prior on 
our priors! In terms of graphical models (Chapter 10), we can represent the situation as follows: 


n-7O—-D (5.76) 


This is an example of a hierarchical Bayesian model, also called a multi-level model, since 
there are multiple levels of unknown quantities. We give a simple example below, and we will 
see many others later in the book. 


Example: modeling related cancer rates 


Consider the problem of predicting cancer rates in various cities (this example is from Johnson 
and Albert 1999, p24)). In particular, suppose we measure the number of people in various 
cities, N;, and the number of people who died of cancer in these cities, 1;. We assume 
Zi ~ Bin(N;, 0:), and we want to estimate the cancer rates 0;. One approach is to estimate 
them all separately, but this will suffer from the sparse data problem (underestimation of the 
rate of cancer due to small N;). Another approach is to assume all the 0; are the same; this is 


called parameter tying. The resulting pooled MLE is just Ê = ce. But the assumption that 


all the cities have the same rate is a rather strong one. A compromise approach is to assume 
that the 6; are similar, but that there may be city-specific variations. This can be modeled by 
assuming the 6; are drawn from some common distribution, say 0; ~ Beta(a, b). The full joint 
distribution can be written as 


N 
p(D,0,n|N) = p(n) | | Bin(ailNi, 0:)Beta(9;\n) 6.77) 


i=l 


where 77 = (a,b). 

Note that it is crucial that we infer 7 = (a,b) from the data; if we just clamp it to a constant, 
the 6; will be conditionally independent, and there will be no information flow between them. 
By contrast, by treating 7 as an unknown (hidden variable), we allow the data-poor cities to 
borrow statistical strength from data-rich ones. 

Suppose we compute the joint posterior p(7,@|D). From this we can get the posterior 
marginals p(6;|D). In Figure 5.11(a), we plot the posterior means, E [6;|D], as blue bars, as well 
as the population level mean, E [a/(a + b)|D], shown as a red line (this represents the average 
of the 6;’s). We see that the posterior mean is shrunk towards the pooled estimate more strongly 
for cities with small sample sizes N;. For example, city 1 and city 20 both have a 0 observed 
cancer incidence rate, but city 20 has a smaller population, so its rate is shrunk more towards 
the population-level estimate (i.e., it is closer to the horizontal red line) than city 1. 

Figure 5.11(b) shows the 95% posterior credible intervals for 6;. We see that city 15, which has 
a very large population (53,637 people), has small posterior uncertainty. Consequently this city 
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Figure 5.11 (a) Results of fitting the model using the data from (johnson and Albert 1999, p24). First 
row: Number of cancer incidents x; in 20 cities in Missouri. Second row: population size N;. The largest 
city (number 15) has a population of Ni5 = 53637 and x15 = 54 incidents, but we truncate the vertical 
axes of the first two rows so that the differences between the other cities are visible. Third row: MLE 6;. 
The red line is the pooled MLE. Fourth row: posterior mean E [0;|D]. The red line is E[a/(a + b)|D], 
the population-level mean. (b) Posterior 95% credible intervals on the cancer rates. Figure generated by 
cancerRatesEb 


has the largest impact on the posterior estimate of 7, which in turn will impact the estimate of 
the cancer rates for other cities. Cities 10 and 19, which have the highest MLE, also have the 
highest posterior uncertainty, reflecting the fact that such a high estimate is in conflict with the 
prior (which is estimated from all the other cities). 

In the above example, we have one parameter per city, modeling the probability the response 
is on. By making the Bernoulli rate parameter be a function of covariates, 0; = sigm(w/ x), we 
can model multiple correlated logistic regression tasks. This is called multi-task learning, and 


will be discussed in more detail in Section 9.5. 


Empirical Bayes 


In hierarchical Bayesian models, we need to compute the posterior on multiple levels of latent 
variables. For example, in a two-level model, we need to compute 


p(n, O|D) x p(D|@)p(6|n)p(n) (5.78) 


In some cases, we can analytically marginalize out 0; this leaves is with the simpler problem of 
just computing p(n|D). 

As a computational shortcut, we can approximate the posterior on the hyper-parameters with 
a point-estimate, p(n|D) ~ a(n), where 7 = argmaxp(n|D). Since 7 is typically much 
smaller than @ in dimensionality, it is less prone to overfitting, so we can safely use a uniform 
prior on 77. Then the estimate becomes 


7) = argmax p(D|n) = argmax | [ oie n(o\n a9 (5.79) 
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where the quantity inside the brackets is the marginal or integrated likelihood, sometimes called 
the evidence. This overall approach is called empirical Bayes (EB) or type-II maximum 
likelihood. In machine learning, it is sometimes called the evidence procedure. 

Empirical Bayes violates the principle that the prior should be chosen independently of the 
data. However, we can just view it as a computationally cheap approximation to inference in a 
hierarchical Bayesian model, just as we viewed MAP estimation as an approximation to inference 
in the one level model 0 — D. In fact, we can construct a hierarchy in which the more integrals 
one performs, the “more Bayesian” one becomes: 


Method Definition 

Maximum likelihood 6 = argmaxg p(D|0) 

MAP estimation 6 = argmaxg p(D|6)p(8|n) 

MLII (Empirical Bayes) ñ = argmax,, f p(D|@)p(@|n)d@ = argmax,, p(D|n) 

MAP-II 7 = argmax,, f p(D|9)p(8|n)p(m)d@ = argmax, p(D|n)p(n) 
Full Bayes P(8, n|D) x p(D|A)p(4|n)p(n) 


Note that EB can be shown to have good frequentist properties (see e.g., (Carlin and Louis 
1996; Efron 2010)), so it is widely used by non-Bayesians. For example, the popular James-Stein 
estimator, discussed in Section 6.3.3.2, can be derived using EB. 


Example: beta-binomial model 


Let us return to the cancer rates model. We can analytically integrate out the 0;’s, and write 
down the marginal likelihood directly, as follows: 


p(D\a,b) = |] J Bin(zx;|N;, 0;)Beta(6;|a, b) dd; (5.80) 


Blað) (5.81) 


{3 
Various ways of maximizing this wrt a and b are discussed in (Minka 2000e). 

Having estimated a and b, we can plug in the hyper-parameters to compute the posterior 
p(0;|4, b, D) in the usual way, using conjugate analysis. The net result is that the posterior 
mean of each @; is a weighted average of its local MLE and the prior means, which depends on 
n = (a,b); but since 77 is estimated based on all the data, each 0; is influenced by all the data. 


Example: Gaussian-Gaussian model 


We now study another example that is analogous to the cancer rates example, except the data is 
real-valued. We will use a Gaussian likelihood and a Gaussian prior. This will allow us to write 
down the solution analytically. 

In particular, suppose we have data from multiple related groups. For example, xij could be 
the test score for student 7 in school j, for j = 1 : D andi = 1: Nj. We want to estimate 
the mean score for each school, 0j. However, since the sample size, N;, may be small for 
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some schools, we can regularize the problem by using a hierarchical Bayesian model, where we 
assume 6; come from a common prior, N (u, T°). 
The joint distribution has the following form: 


D N; 
p(0, Din, 0°) = [N0 r> ] [N E00?) (5.82) 
j=1 i=l 


where we assume g? is known for simplicity. (We relax this assumption in Exercise 24.4.) We 
explain how to estimate 7 below. Once we have estimated 7 = (1,7), we can compute the 
posteriors over the 0;’s. To do that, it simplifies matters to rewrite the joint distribution in the 
following form, exploiting the fact that N; Gaussian measurements with values x;; and variance 


o? are equivalent to one measurement of value 7; = $ DDAA zij with variance o7 = ø? /N;. 
J 
This yields 
D 
pO, iit, o°) = TT NOA, NE; 03) (5.83) 
j=1 


From this, it follows from the results of Section 4.4.1 that the posteriors are given by 


piP AT) = N (0;lB;a + (1 — By)a;, (1 — B;)o}) (5.84) 
Sg ME 

fA £ 5.85 

7 o? + 7 en 


where ji = T and 7? will be defined below. 

The quantity 0 < B; < 1 controls the degree of shrinkage towards the overall mean, ju. If 
the data is reliable for group j (e.g., because the sample size N; is large), then o? will be small 
relative to T?; hence B; will be small, and we will put more weight on T; when we estimate 6;. 
However, groups with small sample sizes will get regularized (shrunk towards the overall mean 
u) more heavily. We will see an example of this below. 

If oj = o for all groups j, the posterior mean becomes 


6, = Br+(1—-B)z; =T + (1-— Bg; — 7) (5.86) 


This has exactly the same form as the James Stein estimator discussed in Section 6.3.3.2. 


Example: predicting baseball scores 


We now give an example of shrinkage applied to baseball batting averages, from (Efron and 
Morris 1975). We observe the number of hits for D = 18 players during the first T = 45 games. 
Call the number of hits b;. We assume b; ~ Bin(T,6;), where 0; is the “true” batting average 
for player j. The goal is to estimate the 0;. The MLE is of course 6; = £j, where x; = b;/T is 
the empirical batting average. However, we can use an EB approach to do better. 

To apply the Gaussian shrinkage approach described above, we require that the likelihood be 
Gaussian, £j ~ N(6;, o?) for known o?. (We drop the i subscript since we assume N; = 1, 
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MLE (top) and shrinkage estimates (bottom) MSE MLE = 0.0042, MSE shrunk = 0.0013 


[true 
shrunk] 
MLE 


player number 


(a) (b) 


Figure 5.12 (a) MLE parameters (top) and corresponding shrunken estimates (bottom). (b) We plot the 
true parameters (blue), the posterior mean estimate (green), and the MLEs (red) for 5 of the players. Figure 
generated by shrinkageDemoBaseball. 


since x; already represents the average for player j.) However, in this example we have a 
binomial likelihood. While this has the right mean, E [x;] = @,, the variance is not constant: 


1 T0;(1—0;) 
T2 var [b;] = O 
So let us apply a variance stabilizing transform’ to x; to better match the Gaussian assump- 
tion: 

yj = f (ys) = VT arcsin(2y; — 1) 
Now we have approximately y; ~ N(f(6;),1) = N(w;,1). We use Gaussian shrinkage to 
estimate the ju; using Equation 5.86 with o? = 1, and we then transform back to get 


6; = 0.5(sin(fj /VT) + 1) 


The results are shown in Figure 5.12(a-b). In (a), we plot the MLE 6; and the posterior mean Oj. 
We see that all the estimates have shrunk towards the global mean, 0.265. In (b), we plot the 
true value 0;, the MLE 0; and the posterior mean 0;. (The “true” values of 0; are estimated 


var [xj] = (5.87) 


(5.88) 


(5.89) 


from a large number of independent games.) We see that, on average, the shrunken estimate 
is much closer to the true parameters than the MLE is. Specifically, the mean squared error, 
defined by MSE = = Dja (ð; —6,)?, is over three times smaller using the shrinkage estimates 


6; than using the MLEs ĝ;. 


Estimating the hyper-parameters 


In this section, we give an algorithm for estimating 7. Suppose initially that o = o° is the 
same for all groups. In this case, we can derive the EB estimate in closed form, as we now show. 
From Equation 4.126, we have 


p(Tī;jlu, T’, o°) = J 165, 07)N Oslu, 7?) = N (T;j|u, T” +07) (5.90) 


5. Suppose E[X] = p and var[X] = ø?(u). Let Y = f(X). Then a Taylor series expansions gives Y ~ 
f (uw) +(X — p) f' (u). Hence var [Y] ~ f'(u)?var [X — u] = f'(u)?o? (u). A variance stabilizing transformation 
is a function f such that f’(j:)?0?(j) is independent of p. 
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Hence the marginal likelihood is 


D 
p(D|p,7?,07) = [NGT + 07) (5.91) 
j=l 


Thus we can estimate the hyper-parameters using the usual MLEs for a Gaussian. For pu, we 
have 


1 D 
A= D% =7 (5.92) 


which is the overall mean. 
For the variance, we can use moment matching (which is equivalent to the MLE for a 
Gaussian): we simply equate the model variance to the empirical variance: 


1 
22 2 4 a A 2 
Fo +o0°= DLG T =s (5.93) 
s=] 
so 7? = s? — o°. Since we know 7°? must be positive, it is common to use the following revised 
estimate: 
7? = max{0, 3° — a7} = (s? — 07) 4 (5.94) 


Hence the shrinkage factor is 


g? 2 
B= = 


o 
5.95 
o2 +? 02 +(s2—0?)4 (5.95) 


In the case where the o?’s are different, we can no longer derive a solution in closed form. 
Exercise 11.13 discusses how to use the EM algorithm to derive an EB estimate, and Exercise 24.4 
discusses how to perform full Bayesian inference in this hierarchical model. 


Bayesian decision theory 


We have seen how probability theory can be used to represent and updates our beliefs about 
the state of the world. However, ultimately our goal is to convert our beliefs into actions. In this 
section, we discuss the optimal way to do this. 

We can formalize any given statistical decision problem as a game against nature (as opposed 
to a game against other strategic players, which is the topic of game theory, see e.g., (Shoham 
and Leyton-Brown 2009) for details). In this game, nature picks a state or parameter or label, 
y € V, unknown to us, and then generates an observation, x € 1, which we get to see. We 
then have to make a decision, that is, we have to choose an action a from some action space 
A. Finally we incur some loss, L(y, a), which measures how compatible our action a is with 
nature’s hidden state y. For example, we might use misclassification loss, L(y,a) = I(y 4 a), 
or squared loss, L(y, a) = (y — a)”. We will see some other examples below. 
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Our goal is to devise a decision procedure or policy, 6 : * — A, which specifies the 
optimal action for each possible input. By optimal, we mean the action that minimizes the 
expected loss: 


6(x) = argminE [L(y, a)] (5.96) 
acA 


In economics, it is more common to talk of a utility function; this is just negative loss, 
U(y,a) = —L(y,a). Thus the above rule becomes 


(x) = argmax E [|U (y, a)] (5.97) 
acA 


This is called the maximum expected utility principle, and is the essence of what we mean 
by rational behavior. 

Note that there are two different interpretations of what we mean by “expected”. In the 
Bayesian version, which we discuss below, we mean the expected value of y given the data we 
have seen so far. In the frequentist version, which we discuss in Section 6.3, we mean the 
expected value of y and x that we expect to see in the future. 

In the Bayesian approach to decision theory, the optimal action, having observed x, is defined 
as the action a that minimizes the posterior expected loss: 


p(a|x) Ê Epy [Lly a)] = X L(y, a)plylx) (5.98) 


(If y is continuous (e.g, when we want to estimate a parameter vector), we should replace the 
sum with an integral.) Hence the Bayes estimator, also called the Bayes decision rule, is given 


by 


6(x) = arg min p(alx) (5.99) 


Bayes estimators for common loss functions 


In this section we show how to construct Bayes estimators for the loss functions most commonly 
arising in machine learning. 


MAP estimate minimizes 0-1 loss 
The 0-1 loss is defined by 


i aoe 
Lwa) =I # a) = { 1 aty (5.100) 


This is commonly used in classification problems where y is the true class label and a = ĝ is 
the estimate. 
For example, in the two class case, we can write the loss matrix as follows: 
vat ġ=0 
= 1 0 1 
0 | 1 0 


y 
y 
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1.0 
threshold 


0.0 


m X 
Reject 
Region 


Figure 5.13 For some regions of input space, where the class posteriors are uncertain, we may prefer not 
to choose class 1 or 2; instead we may prefer the reject option. Based on Figure 1.26 of (Bishop 2006a). 


(In Section 5.7.2, we generalize this loss function so it penalizes the two kinds of errors on 
the off-diagonal differently.) 
The posterior expected loss is 


p(a|x) = pla # ylx) = 1 — plx) (5.101) 
Hence the action that minimizes the expected loss is the posterior mode or MAP estimate 


y* (x) = arg max p(y|x) (5.102) 
yey 


Reject option 


In classification problems where p(y|x) is very uncertain, we may prefer to choose a reject 
action, in which we refuse to classify the example as any of the specified classes, and instead 
say “don’t know”. Such ambiguous cases can be handled by e.g., a human expert. See Figure 5.13 
for an illustration. This is useful in risk averse domains such as medicine and finance. 

We can formalize the reject option as follows. Let choosing a = C + 1 correspond to 
picking the reject action, and choosing a € {1,...,C} correspond to picking one of the classes. 
Suppose we define the loss function as 


0 ift= 7 andi,j € {1,...,C} 
Lyy=j,a=i)= 4 rr ifi=C+1 (5.103) 
As otherwise 


where Ap is the cost of the reject action, and As is the cost of a substitution error. In Exercise 5.3, 
you will show that the optimal action is to pick the reject action if the most probable class has 
a probability below 1 — Àr, ; otherwise you should just pick the most probable class. 
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Figure 5.14 (a-c). Plots of the L(y,a) = |y — al? vs |y — a| for q = 0.2, q = 1 and q = 2. Figure 

generated by lossFunctionFig. 


(a) (b) 


Posterior mean minimizes £2 (quadratic) loss 


For continuous parameters, a more appropriate loss function is squared error, £2 loss, or 
quadratic loss, defined as 


L(y, a) = (y — a)? (5.104) 
The posterior expected loss is given by 

p(alx) = E|[(y—a)?|x] = E [y?|x] — 2aE [y|x] + a? (5.105) 
Hence the optimal estimate is the posterior mean: 

o A Sm 

jyPlalx) =-2Æ [yx] +2a=0 = ġ=Elyix] = f yp(ulxay 65106) 


This is often called the minimum mean squared error estimate or MMSE estimate. 
In a linear regression problem, we have 


p(y|x, 0) = N(y|x" w, 0”) (5.107) 


In this case, the optimal estimate given some training data D is given by 


3 [y|x, D] = x" E[w|D] (5.108) 


That is, we just plug-in the posterior mean parameter estimate. Note that this is the optimal 
thing to do no matter what prior we use for w. 


Posterior median minimizes £4 (absolute) loss 


The £> loss penalizes deviations from the truth quadratically, and thus is sensitive to outliers. A 
more robust alternative is the absolute or £1 loss, L(y, a) = |y— a| (see Figure 5.14). The optimal 
estimate is the posterior median, i.e., a value a such that P(y < a|x) = P(y > a|x) = 0.5. 
See Exercise 5.9 for a proof. 


Supervised learning 


Consider a prediction function 6 : 4 —> YV, and suppose we have some cost function L(y, y’) 
which gives the cost of predicting y’ when the truth is y. We can define the loss incurred by 
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taking action 6 (i.e., using this predictor) when the unknown state of nature is 0 (the parameters 
of the data generating mechanism) as follows: 


L(0, 8) & Ee y)~p(x,yio) (L(y, 5(X)] = X 2 L(y, (x) (x, 91) (5.109) 


This is known as the generalization error. Our goal is to minimize the posterior expected loss, 
given by 


p(o|D) = J romo, ô)d0 (5.110) 
This should be contrasted with the frequentist risk which is defined in Equation 6.47. 


The false positive vs false negative tradeoff 


In this section, we focus on binary decision problems, such as hypothesis testing, two-class 
classification, object/ event detection, etc. There are two types of error we can make: a false 
positive (aka false alarm), which arises when we estimate 4 = 1 but the truth is y = 0; or a 
false negative (aka missed detection), which arises when we estimate 4 = 0 but the truth is 
y = 1. The 0-1 loss treats these two kinds of errors equivalently. However, we can consider the 
following more general loss matrix: 


y= Lrp 0 
where Lpy is the cost of a false negative, and Lpp is the cost of a false positive. The 
posterior expected loss for the two possible actions is given by 


pl =O|x) = Len ply = 1|x) (5.111) 
P(g =1|x) = Lrp ply = 0|x) (5.112) 


Hence we should pick class 7 = 1 iff 


p(y =x) > p(y =1)x) (5.113) 
p(y = 1|x) Lrp 

—— 5.114 
p(y = O|x) Lrn CSA 


If Len = cLrp, it is easy to show (Exercise 5.10) that we should pick ĝ = 1 iff p(y = 
1|x)/p(y = O|x) > 7, where T = c/(1 +c) (see also (Muller et al. 2004)). For example, if a 
false negative costs twice as much as false positive, so c = 2, then we use a decision threshold 
of 2/3 before declaring a positive. 

Below we discuss ROC curves, which provide a way to study the FP-FN tradeoff without having 
to choose a specific threshold. 


ROC curves and all that 


Suppose we are solving a binary decision problem, such as classification, hypothesis testing, 
object detection, etc. Also, assume we have a labeled data set, D = {(x;,y;)}. Let 6(x) = 


5.7. Bayesian decision theory 181 


Truth 
1 0 D>) 
Estimate Te ad Ng See 
FN TN N-=FN4#TN 
D| N,=TP+FN N-=FP+TN | NSTP+FP+FN+TN 


Table 5.2 Quantities derivable from a confusion matrix. N is the true number of positives, Ñ} is the 
“called” number of positives, N— is the true number of negatives, N— is the “called” number of negatives. 


y=1 y=0 
1 | TP/N,=TPR=sensitivity=recall FP/N_=FPR=type I 
=0 | F.N/N=FNR=miss rate=type II | TN/N_-=TNR=specifity 


Table 5.3 Estimating p(ĝ|y) from a confusion matrix. Abbreviations: FNR = false negative rate, FPR = 
false positive rate, TNR = true negative rate, TPR = true positive rate. 


I(f(x) > T) be our decision rule, where f(x) is a measure of confidence that y = 1 (this 
should be monotonically related to p(y = 1|x), but does not need to be a probability), and 7 is 
some threshold parameter. For each given value of 7, we can apply our decision rule and count 
the number of true positives, false positives, true negatives, and false negatives that occur, as 
shown in Table 5.2. This table of errors is called a confusion matrix. 

From this table, we can compute the true positive rate (TPR), also known as the sensitivity, 
recall or hit rate, by using TPR = TP/N, ~ p(g = 1|y = 1). We can also compute the 
false positive rate (FPR), also called the false alarm rate, or the type I error rate, by using 
FPR=FP/N_ ~x p(ĝ = 1|y = 0). These and other definitions are summarized in Tables 5.3 
and 5.4. We can combine these errors in any way we choose to compute a loss function. 

However, rather than than computing the TPR and FPR for a fixed threshold 7, we can run 
our detector for a set of thresholds, and then plot the TPR vs FPR as an implicit function of 
T. This is called a receiver operating characteristic or ROC curve. See Figure 5.15(a) for an 
example. Any system can achieve the point on the bottom left, (FPR = 0,TPR = 0), by 
setting T = 1 and thus classifying everything as negative; similarly any system can achieve the 
point on the top right, (FPR = 1,T PR = 1), by setting r = 0 and thus classifying everything 
as positive. If a system is performing at chance level, then we can achieve any point on the 
diagonal line TPR = FPR by choosing an appropriate threshold. A system that perfectly 
separates the positives from negatives has a threshold that can achieve the top left corner, 
(FPR = 0,TPR = 1); by varying the threshold such a system will “hug” the left axis and 
then the top axis, as shown in Figure 5.15(a). 

The quality of a ROC curve is often summarized as a single number using the area under the 
curve or AUC. Higher AUC scores are better; the maximum is obviously 1. Another summary 
statistic that is used is the equal error rate or EER, also called the cross over rate, defined 
as the value which satisfies FPR = FNR. Since FNR = 1 — TPR, we can compute the 
EER by drawing a line from the top left to the bottom right and seeing where it intersects the 
ROC curve (see points A and B in Figure 5.15(a)). Lower EER scores are better; the minimum is 
obviously 0. 
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precision 


fpr recall 


(a) (b) 


Figure 5.15 (a) ROC curves for two hypothetical classification systems. A is better than B. We plot the 
true positive rate (TPR) vs the false positive rate (FPR) as we vary the threshold 7. We also indicate the 
equal error rate (EER) with the red and blue dots, and the area under the curve (AUC) for classifier B. (b) 
A precision-recall curve for two hypothetical classification systems. A is better than B. Figure generated by 
PRhand. 


g =1 | TP/N.=precision=PPV | F.P/N.=FDP 
ĝ=0 FN/N_ TN/N_=NPV 


Table 5.4 Estimating p(y|) from a confusion matrix. Abbreviations: FDP = false discovery probability, 
NPV = negative predictive value, PPV = positive predictive value, 


Precision recall curves 


When trying to detect a rare event (such as retrieving a relevant document or finding a face 
in an image), the number of negatives is very large. Hence comparing TPR = TP/N, to 
FPR = FP/N_ is not very informative, since the FPR will be very small. Hence all the 
“action” in the ROC curve will occur on the extreme left. In such cases, it is common to plot 
the TPR versus the number of false positives, rather than vs the false positive rate. 

However, in some cases, the very notion of “negative” is not well-defined. For example, when 
detecting objects in images (see Section 1.2.1.3), if the detector works by classifying patches, then 
the number of patches examined — and hence the number of true negatives — is a parameter 
of the algorithm, not part of the problem definition. So we would like to use a measure that 
only talks about positives. 

The precision is defined as TP/N, = p(y = 1|§ = 1) and the recall is defined as 
TP/N, = p(ĝ = lly = 1). Precision measures what fraction of our detections are actually 
positive, and recall measures what fraction of the positives we actually detected. If 9; € {0,1} 
is the predicted label, and y; € {0,1} is the true label, we can estimate precision and recall 
using 
Livlig Livi (5.115) 

ih ivi 
A precision recall curve is a plot of precision vs recall as we vary the threshold 7. See 


Figure 5.15(b). Hugging the top right is the best one can do. 
This curve can be summarized as a single number using the mean precision (averaging over 


P= 
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Class 1 Class 2 Pooled 

y=1 y=0 y= y=0 y=1 y=0 
y= 10 10 g=1 90 10 y=] 100 20 
ĝ=0 10 970 ĝ=0 10 890 g=0 20 1860 


Table 5.5 Illustration of the difference between macro- and micro-averaging. y is the true label, and ĝ 
is the called label. In this example, the macro-averaged precision is [10/(10 + 10) + 90/(10 + 90)]/2 = 
(0.5 + 0.9)/2 = 0.7. The micro-averaged precision is 100/(100 + 20) ~ 0.83. Based on Table 13.7 of 
(Manning et al. 2008). 


recall values), which approximates the area under the curve. Alternatively, one can quote the 
precision for a fixed recall level, such as the precision of the first K = 10 entities recalled. 
This is called the average precision at K score. This measure is widely used when evaluating 
information retrieval systems. 


F-scores * 


For a fixed threshold, one can compute a single precision and recall value. These are often 
combined into a single statistic called the F score, or Fl score, which is the harmonic mean of 
precision and recall: 


2 2PR 


F + = 5.116 
UPR R+P Ene 
Using Equation 5.115, we can write this as 
2 Yii 
Fy = — 22a ib (5.117) 


D Yi + DA Yi 

This is a widely used measure in information retrieval systems. 
To understand why we use the harmonic mean instead of the arithmetic mean, (P + R)/2, 

consider the following scenario. Suppose we recall all entries, so R = 1. The precision will be 

given by the prevalence, p(y = 1). Suppose the prevalence is low, say p(y = 1) = 1074. The 

arithmetic mean of P and R is given by (P + R)/2 = (1074 + 1)/2 ~ 50%. By contrast, the 


harmonic mean of this strategy is only 2x10 = 0.2%. 

In the multi-class case (e.g., for document classification problems), there are two ways to 
generalize F; scores. The first is called macro-averaged FI, and is defined as an Fy(c)/C, 
where F} (c) is the F; score obtained on the task of distinguishing class c from all the others. 
The other is called micro-averaged Fl, and is defined as the F score where we pool all the 
counts from each class’s contingency table. 

Table 5.5 gives a worked example that illustrates the difference. We see that the precision of 
class 1 is 0.5, and of class 2 is 0.9. The macro-averaged precision is therefore 0.7, whereas the 
micro-averaged precision is 0.83. The latter is much closer to the precision of class 2 than to 
the precision of class 1, since class 2 is five times larger than class 1. To give equal weight to 
each class, use macro-averaging. 
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False discovery rates * 


Suppose we are trying to discover a rare phenomenon using some kind of high throughput 
measurement device, such as a gene expression micro array, or a radio telescope. We will need 
to make many binary decisions of the form p(y; = 1|D) > r, where D = {x;}_, and N may 
be large. This is called multiple hypothesis testing. Note that the difference from standard 
binary classification is that we are classifying y; based on all the data, not just based on x;. So 
this is a simultaneous classification problem, where we might hope to do better than a series of 
individual classification problems. 

How should we set the threshold 7? A natural approach is to try to minimize the expected 
number of false positives. In the Bayesian approach, this can be computed as follows: 


FD(r,D)£ X (1- pi) I(pi > 7) (5.118) 
“pr. error discovery 


where p; = p(y; = 1|D) is your belief that this object exhibits the phenomenon in question. 
We then define the posterior expected false discovery rate as follows: 


FDR(rt,D) ê FD(T,D)/N (T, D) (5.119) 


where N(t,D) = $; I(pi > T) is the number of discovered items. Given a desired FDR 
tolerance, say œ = 0.05, one can then adapt 7 to achieve this; this is called the direct posterior 
probability approach to controlling the FDR (Newton et al. 2004; Muller et al. 2004). 

In order to control the FDR it is very helpful to estimate the p;’s jointly (e.g., using a hierar- 
chical Bayesian model, as in Section 5.5), rather than independently. This allows the pooling of 
statistical strength, and thus lower FDR. See e.g., (Berry and Hochberg 1999) for more information. 


Other topics * 


In this section, we briefly mention a few other topics related to Bayesian decision theory. We do 
not have space to go into detail, but we include pointers to the relevant literature. 


Contextual bandits 


A one-armed bandit is a colloquial term for a slot machine, found in casinos around the world. 
The game is this: you insert some money, pull an arm, and wait for the machine to stop; if 
you're lucky, you win some money. Now imagine there is a bank of K such machines to choose 
from. Which one should you use? This is called a multi-armed bandit, and can be modeled 
using Bayesian decision theory: there are K possible actions, and each action has an unknown 
reward (payoff function) rg. By maintaining a belief state, p(r1:x|D) = Į [p p(re|D), one can 
devise an optimal policy; this can be compiled into a series of Gittins Indices (Gittins 1989). 
This optimally solves the exploration-exploitation tradeoff, which specifies how many times 
one should try each action before deciding to go with the winner. 

Now consider an extension where each arm, and the player, has an associated feature vector; 
call all these features x. This is called a contextual bandit (see e.g., (Sarkar 1991; Scott 2010; 
Li et al. 2011). For example, the “arms” could represent ads or news articles which we want 
to show to the user, and the features could represent properties of these ads or articles, such 
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as a bag of words, as well as properties of the user, such as demographics. If we assume a 
linear model for reward, rg = OL x, we can maintain a distribution over the parameters of each 
arm, p(@;|D), where D is a series of tuples of the form (a,x,7), which specifies which arm 
was pulled, what its features were, and what the resulting outcome was (e.g., r = 1 if the user 
clicked on the ad, and r = 0 otherwise). We discuss ways to compute p(0;|D) from linear and 
logistic regression models in later chapters. 

Given the posterior, we must decide what action to take. One common heuristic, known as 
UCB (which stands for “upper confidence bound”) is to take the action which maximizes 


k* = argmax ur + ATR (5.120) 
k=1 
where up = E[rx|D], o} = var [r;,|D] and A is a tuning parameter that trades off exploration 


and exploitation. The intuition is that we should pick actions about which we believe are good 
(tux is large), and/ or actions about which we are uncertain (ox is large). 

An even simpler method, known as Thompson sampling, is as follows. At each step, we pick 
action k with a probability that is equal to its probability of being the optimal action: 


Pk = fx 2 [r]a, x, 0] = max E [r|a’, x, 6])p(@|D)d0 (5.121) 
We can approximate this by drawing a single sample from the posterior, 0° ~ p(0|D), and then 
choosing k* = argmax, E [r|x, k, 6"). Despite its simplicity, this has been shown to work quite 
well (Chapelle and Li 2011). 


Utility theory 


Suppose we are a doctor trying to decide whether to operate on a patient or not. We imagine 
there are 3 states of nature: the patient has no cancer, the patient has lung cancer, or the 
patient has breast cancer. Since the action and state space is discrete, we can represent the loss 
function L(@,a) as a loss matrix, such as the following: 

| Surgery No surgery 


No cancer 20 0 
Lung cancer 10 50 
Breast cancer 10 60 


These numbers reflects the fact that not performing surgery when the patient has cancer is 
very bad (loss of 50 or 60, depending on the type of cancer), since the patient might die; not 
performing surgery when the patient does not have cancer incurs no loss (0); performing surgery 
when the patient does not have cancer is wasteful (loss of 20); and performing surgery when 
the patient does have cancer is painful but necessary (10). 

It is natural to ask where these numbers come from. Ultimately they represent the personal 
preferences or values of a fictitious doctor, and are somewhat arbitrary: just as some people 
prefer chocolate ice cream and others prefer vanilla, there is no such thing as the “right” loss/ 
utility function. However, it can be shown (see e.g., (DeGroot 1970)) that any set of consistent 
preferences can be converted to a scalar loss/ utility function. Note that utility can be measured 
on an arbitrary scale, such as dollars, since it is only relative values that matter.® 


6. People are often squeamish about talking about human lives in monetary terms, but all decision making requires 
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Sequential decision theory 


So far, we have concentrated on one-shot decision problems, where we only have to make 
one decision and then the game ends. In Setion 10.6, we will generalize this to multi-stage or 
sequential decision problems. Such problems frequently arise in many business and engineering 
settings. This is closely related to the problem of reinforcement learning. However, further 
discussion of this point is beyond the scope of this book. 


Exercises 


Exercise 5.1 Proof that a mixture of conjugate priors is indeed conjugate 


Derive Equation 5.69. 


Exercise 5.2 Optimal threshold on classification probability 


Consider a case where we have learned a conditional probability distribution P(y|x). Suppose there are 
only two classes, and let pp = P(Y = 0|x) and pı = P(Y = 1|x). Consider the loss matrix below: 


predicted | true label y 
label ĝ 0 1 
0 0 Aoi 

1 A10 0 


a. Show that the decision ĝ that minimizes the expected loss is equivalent to setting a probability threshold 
0 and predicting ĝ = 0 if pı < 0 and ĝ = 1 if pı > 0. What is @ as a function of Aoi and Aio? (Show 
your work.) 


b. Show a loss matrix where the threshold is 0.1. (Show your work.) 


Exercise 5.3 Reject option in classifiers 
(Source: (Duda et al. 2001, Q2.13).) 


In many classification problems one has the option either of assigning x to class j or, if you are too 
uncertain, of choosing the reject option. If the cost for rejects is less than the cost of falsely classifying 
the object, it may be the optimal action. Let a; mean you choose action i, for i = 1 : C + 1, where C 
is the number of classes and C + 1 is the reject action. Let Y = j be the true (but unknown) state of 
nature. Define the loss function as follows 


0 ift= 7 andi,j € {1,...,C} 
Aail¥ =j)=4 Ar ifi=C+1 (5.122) 
As otherwise 


In otherwords, you incur 0 loss if you correctly classify, you incur , loss (cost) if you choose the reject 
option, and you incur As loss (cost) if you make a substitution error (misclassification). 


tradeoffs, and one needs to use some kind of “currency” to compare different courses of action. Insurance companies 
do this all the time. Ross Schachter, a decision theorist at Stanford University, likes to tell a story of a school board who 
rejected a study on absestos removal from schools because it performed a cost-benefit analysis, which was considered 
“inhumane” because they put a dollar value on children’s health; the result of rejecting the report was that the absestos 
was not removed, which is surely more “inhumane”. In medical domains, one often measures utility in terms of QALY, or 
quality-adjusted life-years, instead of dollars, but it’s the same idea. Of course, even if you do not explicitly specify how 
much you value different people's lives, your behavior will reveal your implicit values/ preferences, and these preferences 
can then be converted to a real-valued scale, such as dollars or QALY. Inferring a utility function from behavior is called 
inverse reinforcement learning. 
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Decision | true label y 
gy 0 1 
predict 0 0 10 
predict 1 10 0 
reject 3 3 


a. Show that the minimum risk is obtained if we decide Y = j if p(Y = j|x) > p(Y = k|x) for all k 
(ie. j is the most probable class) and if p(Y = j|x) > 1 — 2; otherwise we decide to reject. 


b. Describe qualitatively what happens as \,-/As is increased from 0 to 1 (i.e., the relative cost of rejection 
increases). 


Exercise 5.4 More reject options 


In many applications, the classifier is allowed to “reject” a test example rather than classifying it into one 
of the classes. Consider, for example, a case in which the cost of a misclassification is $10 but the cost of 
having a human manually make the decison is only $3. We can formulate this as the following loss matrix: 


a. Suppose P(y = 1|x) is predicted to be 0.2. Which decision minimizes the expected loss? 
b. Now suppose P(y = 1|x)=0.4. Now which decision minimizes the expected loss? 


c. Show that in general, for this loss matrix, but for any posterior distribution, there will be two thresholds 
Oo and @; such that the optimal decisionn is to predict 0 if pı < Oo, reject if 60 < pi < 01, and 
predict 1 if pı > 6; (where pı = p(y = 1|x)). What are these thresholds? 


Exercise 5.5 Newsvendor problem 


Consider the following classic problem in decision theory/ economics. Suppose you are trying to decide 
how much quantity Q of some product (e.g., newspapers) to buy to maximize your profits. The optimal 
amount will depend on how much demand D you think there is for your product, as well as its cost 
to you C and its selling price P. Suppose D is unknown but has pdf f(D) and cdf F(D). We can 
evaluate the expected profit by considering two cases: if D > Q, then we sell all Q items, and make profit 
m = (P — C)Q; but if D < Q, we only sell D items, at profit (P — C)D, but have wasted C(Q — D) 
on the unsold items. So the expected profit if we buy quantity Q is 


oo Q Q 
Er(Q) = [ (P —C)QF(D)aD + f (P —C)Df(D) — l C(Q-D)f(D)dD 6123) 


Simplify this expression, and then take derivatives wrt Q to show that the optimal quantity Q* (which 
maximizes the expected profit) satisfies 


2 P- C 
F(Q") = -p (5.124) 


Exercise 5.6 Bayes factors and ROC curves 


Let B = p(D|H1)/p(D|Ho) be the bayes factor in favor of model 1. Suppose we plot two ROC curves, 
one computed by thresholding B, and the other computed by thresholding p(Hi|D). Will they be the 
same or different? Explain why. 


Exercise 5.7 Bayes model averaging helps predictive accuracy 


Let A be a quantity that we want to predict, let D be the observed data and M be a finite set of models. 
Suppose our action is to provide a probabilistic prediction p(), and the loss function is L(A, p()) = 


188 Chapter 5. Bayesian statistics 


— log p(A). We can either perform Bayes model averaging and predict using 
pPM4(A) = F p(Alm,D)p(mID) (6.125) 
meM 
or we could predict using any single model (a plugin approximation) 
p™ (A) = p(Alm, D) (5.126) 


Show that, for all models m € M, the posterior expected loss using BMA is lower, i.e., 
E [eap] < E[L(A,p™)] (5.127) 
where the expectation over A is with respect to 


P(A|D) = X p(Alm,D)p(m|D) (5.128) 
mEM 


Hint: use the non-negativity of the KL divergence. 


Exercise 5.8 MLE and model selection for a 2d discrete distribution 
(Source: Jaakkola.) 


Let x € {0,1} denote the result of a coin toss (x = 0 for tails, v = 1 for heads). The coin is potentially 
biased, so that heads occurs with probability 01. Suppose that someone else observes the coin flip and 
reports to you the outcome, y. But this person is unreliable and only reports the result correctly with 
probability 62; i.e., p(y|x, 02) is given by 


| y=0 y=1 
x=0 | 62 1— 02 
g=] 1— 62 b2 


Assume that 02 is independent of x and 61. 


a. Write down the joint probability distribution p(x, y|@) as a 2 x 2 table, in terms of 0 = (01, 02). 
b. Suppose have the following dataset: x = (1,1,0,1,1,0,0), y = (1,0,0,0,1,0,1). What are the 
MLEs for @; and 62? Justify your answer. Hint: note that the likelihood function factorizes, 
p(x, y|@) = ply|x, 2)p(x|1) (5.129) 
What is p(D|@, M2) where Mə denotes this 2-parameter model? (You may leave your answer in 
fractional form if you wish.) 


c. Now consider a model with 4 parameters, 9 = (60,0, 90,1, 91,0, 91,1), representing p(x, y|9) = Ox,y. 
(Only 3 of these parameters are free to vary, since they must sum to one.) What is the MLE of 0? What 


is p(D|@, M4) where Ma4 denotes this 4-parameter model? 


d. Suppose we are not sure which model is correct. We compute the leave-one-out cross validated log 
likelihood of the 2-parameter model and the 4-parameter model as follows: 


L(m) = 5 log p(2i, yilm, O(D-1)) (5.130) 
i=l 


and @(D_;)) denotes the MLE computed on D excluding row i. Which model will CV pick and why? 
Hint: notice how the table of counts changes when you omit each training case one at a time. 
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e. Recall that an alternative to CV is to use the BIC score, defined as 


BIC(M, D) ê tog p(D|@xrze) - ECD 


log N 6139 


where dof (M) is the number of free parameters in the model, Compute the BIC scores for both models 
(use log base e). Which model does BIC prefer? 


Exercise 5.9 Posterior median is optimal estimate under L1 loss 


Prove that the posterior median is optimal estimate under L1 loss. 


Exercise 5.10 Decision rule for trading off FPs and FNs 
If Len = cL rp, show that we should pick ĝ = 1 iff p(y = 1|x)/p(y = 0|x) > 7, where 7 = c/(1 + c) 


6.1 


6.2 


Frequentist statistics 


Introduction 


The approach to statistical inference that we described in Chapter 5 is known as Bayesian 
statistics. Perhaps surprisingly, this is considered controversial by some people, whereas the ap- 
plication of Bayes rule to non-statistical problems — such as medical diagnosis (Section 2.2.3.1), 
spam filtering (Section 3.4.4.1), or airplane tracking (Section 18.2.1) — is not controversial. The 
reason for the objection has to do with a misguided distinction between parameters of a statis- 
tical model and other kinds of unknown quantities.! 

Attempts have been made to devise approaches to statistical inference that avoid treating 
parameters like random variables, and which thus avoid the use of priors and Bayes rule. Such 
approaches are known as frequentist statistics, classical statistics or orthodox statistics. 
Instead of being based on the posterior distribution, they are based on the concept of a sampling 
distribution. This is the distribution that an estimator has when applied to multiple data sets 
sampled from the true but unknown distribution; see Section 6.2 for details. It is this notion 
of variation across repeated trials that forms the basis for modeling uncertainty used by the 
frequentist approach. 

By contrast, in the Bayesian approach, we only ever condition on the actually observed data; 
there is no notion of repeated trials. This allows the Bayesian to compute the probability of 
one-off events, as we discussed in Section 2.1. Perhaps more importantly, the Bayesian approach 
avoids certain paradoxes that plague the frequentist approach (see Section 6.6). Nevertheless, it 
is important to be familiar with frequentist statistics (especially Section 6.5), since it is widely 
used in machine learning. 


Sampling distribution of an estimator 


In frequentist statistics, a parameter estimate Ê is computed by applying an estimator ô to 
some data D, so ô = 6(D). The parameter is viewed as fixed and the data as random, which 
is the exact opposite of the Bayesian approach. The uncertainty in the parameter estimate can 
be measured by computing the sampling distribution of the estimator. To understand this 


l. Parameters are sometimes considered to represent true (but unknown) physical quantities, which are therefore not 
random. However, we have seen that it is perfectly reasonable to use a probability distribution to represent one’s 
uncertainty about an unknown constant. 
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Boot: true = 0.70, n=10, mle = 0.90, se = 0.001 Boot: true = 0.70, n=100, mle = 0.70, se = 0.000 


(a) (b) 


Figure 6.1 A bootstrap approximation to the sampling distribution of Ê for a Bernoulli distribution. We 
use B = 10,000 bootstrap samples. The N datacases were generated from Ber(@ = 0.7). (a) MLE with 
N = 10. (b) MLE with N = 100. Figure generated by bootstrapDemoBer. 


concept, imagine sampling many different data sets D‘°) from some true model, p(-|9*), i.e., let 
De) = {al}, where x? ~ p(-|0*), and 6* is the true parameter. Here s = 1 : S indexes 
the sampled data set, and N is the size of each such dataset. Now apply the estimator 6(-) 
to each D(®) to get a set of estimates, {9(D“))}. As we let S — 00, the distribution induced 
on 6 (-) is the sampling distribution of the estimator. We will discuss various ways to use the 


sampling distribution in later sections. But first we sketch two approaches for computing the 
sampling distribution itself. 


Bootstrap 


The bootstrap is a simple Monte Carlo technique to approximate the sampling distribution. This 
is particularly useful in cases where the estimator is a complex function of the true parameters. 

The idea is simple. If we knew the true parameters 0*, we could generate many (say S) fake 
datasets, each of size N, from the true distribution, 77 ~ p(-|0*), for s =1:S5,7=1:N. 
We could then compute our estimator from each sample, 6° = f (xi.y) and use the empirical 
distribution of the resulting samples as our estimate of the sampling distribution. Since 0 is 
unknown, the idea of the parametric bootstrap is to generate the samples using 6(D) instead. 
An alternative, called the non-parametric bootstrap, is to sample the x? (with replacement) 
from the original data D, and then compute the induced distribution as before. Some methods 
for speeding up the bootstrap when applied to massive data sets are discussed in (Kleiner et al. 
2011). 

Figure 6.1 shows an example where we compute the sampling distribution of the MLE for 
a Bernoulli using the parametric bootstrap. (Results using the non-parametric bootstrap are 
essentially the same.) We see that the sampling distribution is asymmetric, and therefore quite 
far from Gaussian, when N = 10; when N = 100, the distribution looks more Gaussian, as 
theory suggests (see below). 

A natural question is: what is the connection between the parameter estimates 0° = (ars. 
computed by the bootstrap and parameter values sampled from the posterior, 0° ~ p(- ID)? 
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Conceptually they are quite different. But in the common case that that the prior is not very 
strong, they can be quite similar. For example, Figure 6.1(c-d) shows an example where we 
compute the posterior using a uniform Beta(l,1) prior, and then sample from it. We see that 
the posterior and the sampling distribution are quite similar. So one can think of the bootstrap 
distribution as a “poor man’s” posterior; see (Hastie et al. 2001, p235) for details. 

However, perhaps surprisingly, bootstrap can be slower than posterior sampling. The reason 
is that the bootstrap has to fit the model S times, whereas in posterior sampling, we usually 
only fit the model once (to find a local mode), and then perform local exploration around the 
mode. Such local exploration is usually much faster than fitting a model from scratch. 


Large sample theory for the MLE * 


In some cases, the sampling distribution for some estimators can be computed analytically. In 
particular, it can be shown that, under certain conditions, as the sample size tends to infinity, 
the sampling distribution of the MLE becomes Gaussian. Informally, the requirement for this 
result to hold is that each parameter in the model gets to “see” an infinite amount of data, and 
that the model be identifiable. Unfortunately this excludes many of the models of interest to 
machine learning. Nevertheless, let us assume we are in a simple setting where the theorem 
holds. 

The center of the Gaussian will be the MLE @. But what about the variance of this Gaussian? 
Intuitively the variance of the estimator will be (inversely) related to the amount of curvature of 
the likelihood surface at its peak. If the curvature is large, the peak will be “sharp”, and the 
variance low; in this case, the estimate is “well determined”. By contrast, if the curvature is 
small, the peak will be nearly “flat”, so the variance is high. 

Let us now formalize this intuition. Define the score function as the gradient of the log 
likelihood evaluated at some point 6: 


s(0) = V log p(D|6)|4 (6.1) 
Define the observed information matrix as the gradient of the negative score function, or 
equivalently, the Hessian of the NLL: 

J(6(D)) ê —Vs(6) = —V2 log p(D|4)|, (6.2) 
In 1D, this becomes 


5 d 
J(0(D)) = -7 log p(P|9)|4 (6.3) 


This is just a measure of curvature of the log-likelihood function at ô. 

Since we are studying the sampling distribution, D = (x,...,x,) is a set of random 
variables. The Fisher information matrix is defined to be the expected value of the observed 
information matrix: 


In (6|0") Eo- [3(6|D)| (6.4) 


2. This is not the usual definition, but is equivalent to it under standard assumptions. More precisely, the standard 


definition is as follows (we just give the scalar case to simplify notation): [(6|0*) & varg« [4 log p(X |6)| Ab that 


is, the variance of the score function. If Ê is the MLE, it is easy to see that Eg» l4 log p( X19) la] = 0 (since 
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where Ee» [f(D)] £ ODA f(x;)p(x;|0*) is the expected value of the function f when 
applied to data sampled from 6*. Often 0*, representing the “true parameter” that generated 
the data, is assumed known, so we just write Iy(@) & In (ĝ|0*) for short. Furthermore, it is 
easy to see that Iy (Ô) = NI; (Î), because the log-likelihood for a sample of size N is just N 
times “steeper” than the log-likelihood for a sample of size 1. So we can drop the 1 subscript 
and just write I(@) £ I; (Ô). This is the notation that is usually used. 


Now let 6 ê Omie(D) be the MLE, where D ~ 6%. It can be shown that 


6 + N(0*,In(0*)~*) (6.5) 


as N —> œ (see e.g., (Rice 1995, p265) for a proof). We say that the sampling distribution of the 
MLE is asymptotically normal. 

What about the variance of the MLE, which can be used as some measure of confidence 
in the MLE? Unfortunately, 6* is unknown, so we can't evaluate the variance of the sampling 
distribution. However, we can approximate the sampling distribution by replacing 6* with @. 
Consequently, the approximate standard errors of Êy, are given by 


sep S Iy (One (6.6) 


For example, from Equation 5.60 we know that the Fisher information for a binomial sampling 
model is 


1 
I(0 es 6.7 
@ = aa ne 
So the approximate standard error of the MLE is 
se (6.8) 


1 
a | (2) 

\/In(6)  \/ N18) il 
where Ê = x X; X;. Compare this to Equation 3.27, which is the posterior standard deviation 
under a uniform prior. 


Frequentist decision theory 


In frequentist or classical decision theory, there is a loss function and a likelihood, but there is 
no prior and hence no posterior or posterior expected loss. Thus there is no automatic way of 
deriving an optimal estimator, unlike the Bayesian case. Instead, in the frequentist approach, we 
are free to choose any estimator or decision procedure ô : X — A we want. 


the gradient must be zero at a maximum), so the variance reduces to the expected square of the score function: 
I(6|0*) = Eg« [G5 log p(X16))?]. It can be shown (e.g, (Rice 1995, p263)) that Eg» [4 log n(x 10)? | = 


2 
—Eg« | log p(X \9)], so now the Fisher information reduces to the expected second derivative of the NLL, which 


is a much more intuitive quantity than the variance of the score. 

3. In practice, the frequentist approach is usually only applied to one-shot statistical decision problems — such as 
classification, regression and parameter estimation — since its non-constructive nature makes it difficult to apply to 
sequential decision problems, which adapt to data online. 
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Having chosen an estimator, we define its expected loss or risk as follows: 


ROG", 8) È Eppo) [ED] = f LO, ADB ad (639) 


where D is data sampled from “nature's distribution”, which is represented by parameter 0*. In 
other words, the expectation is wrt the sampling distribution of the estimator. Compare this to 
the Bayesian posterior expected loss: 


p(alD, T) = 1 (0|D, 7) [L(9, a)| = | L(@, a)p(0|D, 7)dd (6.10) 
© 


We see that the Bayesian approach averages over 0 (which is unknown) and conditions on D 
(which is known), whereas the frequentist approach averages over D (thus ignoring the observed 
data), and conditions on 0* (which is unknown). 

Not only is the frequentist definition unnatural, it cannot even be computed, because 6* is 


unknown. Consequently, we cannot compare different estimators in terms of their frequentist 
risk. We discuss various solutions to this below. 
Bayes risk 


How do we choose amongst estimators? We need some way to convert R(@*,6) into a single 
measure of quality, R(6), which does not depend on knowing 6”. One approach is to put a 
prior on 0*, and then to define Bayes risk or integrated risk of an estimator as follows: 


Rp(6) 2 Eye) [R(8", 5)] = / R(0", 5)p(0")a0" 6. 


A Bayes estimator or Bayes decision rule is one which minimizes the expected risk: 
ôg Ê argmin Rp(6d) (6.12) 
ô 
Note that the integrated risk is also called the preposterior risk, since it is before we have seen 
the data. Minimizing this can be useful for experiment design. 


We will now prove a very important theorem, that connects the Bayesian and frequentist 
approaches to decision theory. 


Theorem 6.3.1. A Bayes estimator can be obtained by minimizing the posterior expected loss for 
each x. 


Proof. By switching the order of integration, we have 
Rg(ô) = J > XC L(y, 6(x))p(x, ve”) p(0*)d0* (6.13) 
x y 


= SD [ 1. 600)p0«, 9,6") a0" (6.14) 


x 


= > £ Lyssa p(x) (6.15) 
= X p(5(x)|x) p(x) (6.16) 


x 
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0 


Figure 6.2 Risk functions for two decision procedures, 5; and 2. Since 5; has lower worst case risk, it 
is the minimax estimator, even though d2 has lower risk for most values of 0. Thus minimax estimators 
are overly conservative. 


To minimize the overall expectation, we just minimize the term inside for each x, so our decision 
tule is to pick 


s(x) = argmin p(a|x) (6.17) 
acA 


Hence we see that the picking the optimal action on a case-by-case basis (as in the Bayesian 
approach) is optimal on average (as in the frequentist approach). In other words, the Bayesian 
approach provides a good way of achieving frequentist goals. In fact, one can go further and 
prove the following. 


Theorem 6.3.2 (Wald, 1950). Every admissable decision rule is a Bayes decision rule with respect 
to some, possibly improper, prior distribution. 


This theorem shows that the best way to minimize frequentist risk is to be Bayesian! See 
(Bernardo and Smith 1994, p448) for further discussion of this point. 
Minimax risk 


Obviously some frequentists dislike using Bayes risk since it requires the choice of a prior (al- 
though this is only in the evaluation of the estimator, not necessarily as part of its construction). 
An alternative approach is as follows. Define the maximum risk of an estimator as 


Rma) = max R(0*, ô) (6.18) 


A minimax rule is one which minimizes the maximum risk: 


ÔMM £ em Roia (ô) (6.19) 
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For example, in Figure 6.2, we see that 6; has lower worst-case risk than 62, ranging over all 
possible values of 0*, so it is the minimax estimator (see Section 6.3.3.1 for an explanation of 
how to compute a risk function for an actual model). 

Minimax estimators have a certain appeal. However, computing them can be hard. And 
furthermore, they are very pessimistic. In fact, one can show that all minimax estimators 
are equivalent to Bayes estimators under a least favorable prior. In most statistical situations 
(excluding game theoretic ones), assuming nature is an adversary is not a reasonable assumption. 


Admissible estimators 


The basic problem with frequentist decision theory is that it relies on knowing the true distri- 
bution p(-|6*) in order to evaluate the risk. However, It might be the case that some estimators 
are worse than others regardless of the value of 0*. In particular, if R(0, 61) < R(@, 52) for all 
0 € O, then we say that 6, dominates 62. The domination is said to be strict if the inequality 
is strict for some 0. An estimator is said to be admissible if it is not strictly dominated by any 
other estimator. 


Example 


Let us give an example, based on (Bernardo and Smith 1994). Consider the problem of estimating 
the mean of a Gaussian. We assume the data is sampled from x; ~ \(0*,o? = 1) and use 
quadratic loss, L(6,0) = (6 — ĝ)?. The corresponding risk function is the MSE. Some possible 
decision rules or estimators 6(x) = 5(x) are as follows: 


a ô (x 
° ô2 


II 


) = 7, the sample mean 
(x) = x, the sample median 
e 63(x) = 0p, a fixed value 
e (x), the posterior mean under a N (|00, 07/k) prior: 


7 N AT K 
Wa N+k 


k(x) 09 = wT + (1 — w)% (6.20) 


For ô, we consider a weak prior, « = 1, and a stronger prior, x = 5. The prior mean is 0ọ, 
some fixed value. We assume o? is known. (Thus ô; (x) is the same as ô, (x) with an infinitely 
strong prior, K = 00.) 

Let us now derive the risk functions analytically. (We can do this since in this toy example, 
we know the true parameter 6*.) In Section 6.4.4, we show that the MSE can be decomposed 
into squared bias plus variance: 


MSE(6(-)|0*) = var g + bias? (Ô) (6.21) 
The sample mean is unbiased, so its risk is 


MSE(65,|6") = var [z] = (6.22) 
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Figure 6.3 Risk functions for estimating the mean of a Gaussian using data sampled M(8*, o° = 1). 
The solid dark blue horizontal line is the MLE, the solid light blue curved line is the posterior mean when 
k = 5. Left: N = 5 samples. Right: N = 20 samples. Based on Figure B.1 of (Bernardo and Smith 1994). 
Figure generated by riskFnGauss. 


The sample median is also unbiased. One can show that the variance is approximately m /(2N), 
so 
T 


For 63(x) = ĝo, the variance is zero, so 
MSE(653|6*) = (0* — 0)? (6.24) 
Finally, for the posterior mean, we have 
MSE(5,|0") = E [wz + (1 Sales "| (6.25) 
= E [we — 0*) + (1 — w) (ðo — 6"))"] (6.26) 
g2 
= w*— +(1—w)?(0 —6*)? (6.27) 
N 
1 * 
(Van? (No? + &7(09 — 0*)) (6.28) 


These functions are plotted in Figure 6.3 for N € {5,20}. We see that in general, the best 
estimator depends on the value of 0*, which is unknown. If 0* is very close to 0, then 63 
(which just predicts 09) is best. If 0* is within some reasonable range around ĝo, then the 
posterior mean, which combines the prior guess of 69 with the actual data, is best. If 0* is far 
from ĝo, the MLE is best. None of this should be suprising: a small amount of shrinkage (using 
the posterior mean with a weak prior) is usually desirable, assuming our prior mean is sensible. 

What is more surprising is that the risk of decision rule 52 (sample median) is always higher 
than that of 6, (sample mean) for every value of 0*. Consequently the sample median is an 
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inadmissible estimator for this particular problem (where the data is assumed to come from a 
Gaussian). 

In practice, the sample median is often better than the sample mean, because it is more 
robust to outliers. One can show (Minka 2000d) that the median is the Bayes estimator (under 
squared loss) if we assume the data comes from a Laplace distribution, which has heavier tails 
than a Gaussian. More generally, we can construct robust estimators by using flexible models 
of our data, such as mixture models or non-parametric density estimators (Section 14.7.2), and 
then computing the posterior mean or median. 


Stein’s paradox * 


Suppose we have N iid random variables X; ~ M (0;, 1), and we want to estimate the 0;. The 
obvious estimator is the MLE, which in this case sets 6; = 2;. It turns out that this is an 
inadmissible estimator under quadratic loss, when N > 4. 

To show this, it suffices to construct an estimator that is better. The James-Stein estimator is 
one such estimator, and is defined as follows: 


6; = Be+(1—B)x,=27+(1-B)(a;-72) (6.29) 


where 7 = + oe x; and 0 < B < 1 is some tuning constant. This estimate “shrinks” the 
9; towards the overall mean. (We derive this estimator using an empirical Bayes approach in 
Section 5.6.2.) 

It can be shown that this shrinkage estimator has lower frequentist risk (MSE) than the MLE 
(sample mean) for N > 4. This is known as Stein’s paradox. The reason it is called a paradox 
is illustrated by the following example. Suppose 0; is the “true” IQ of student 7 and X; is his 
test score. Why should my estimate of 0; depend on the global mean 7z, and hence on some 
other student’s scores? One can create even more paradoxical examples by making the different 
dimensions be qualitatively different, e.g., 0, is my IQ, 02 is the average rainfall in Vancouver, 
etc. 

The solution to the paradox is the following. If your goal is to estimate just 6;, you cannot do 
better than using x;, but if the goal is to estimate the whole vector 0, and you use squared error 
as your loss function, then shrinkage helps. To see this, suppose we want to estimate ||@||3 from 
a single sample x ~ \V(0,1). A simple estimate is ||x||3, but this will overestimate the result, 
since 

N 


E [lixll2] = E £ = =X (1+6) =N + |lel (6.30) 


i=1 


Consequently we can reduce our risk by pooling information, even from unrelated sources, and 
shrinking towards the overall mean. In Section 5.6.2, we give a Bayesian explanation for this. 
See also (Efron and Morris 1975). 


Admissibility is not enough 


It seems clear that we can restrict our search for good estimators to the class of admissible 
estimators. But in fact it is easy to construct admissible estimators, as we show in the following 
example. 


6.4 


6.4.1 
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Theorem 6.3.3. Let X ~ N (0,1), and consider estimating 0 under squared loss. Let &ı (x) = 90, 
a constant independent of the data. This is an admissible estimator. 


Proof. Suppose not. Then there is some other estimator 52 with smaller risk, so R(@*,d2) < 
R(6*,61), where the inequality must be strict for some 6*. Suppose the true parameter is 


0* = 0o. Then R(0*, 51) = 0, and 


Since 0 < R(0*,ô2) < R(0*, 1) for all 0*, and R(A,51) = 0, we have R(6o, 62) = 0 and 
hence 52(”) = 09 = ĝı (x). Thus the only way 62 can avoid having higher risk than 6, at some 
specific point ĝo is by being equal to 61. Hence there is no other estimator ô> with strictly lower 
risk, so ô is admissible. 


Desirable properties of estimators 


Since frequentist decision theory does not provide an automatic way to choose the best estimator, 
we need to come up with other heuristics for choosing amongst them. In this section, we discuss 
some properties we would like estimators to have. Unfortunately, we will see that we cannot 
achieve all of these properties at the same time. 


Consistent estimators 


An estimator is said to be consistent if it eventually recovers the true parameters that generated 
the data as the sample size goes to infinity, i.e, 0(D) > 0* as |D| — oo (where the arrow 
denotes convergence in probability). Of course, this concept only makes sense if the data actually 
comes from the specified model with parameters 0*, which is not usually the case with real 
data. Nevertheless, it can be a useful theoretical property. 

It can be shown that the MLE is a consistent estimator. The intuitive reason is that maxi- 
mizing likelihood is equivalent to minimizing KL (p(-|0*)||p(-|@) ), where p(-|@*) is the true 


distribution and p(-|@) is our estimate. We can achieve 0 KL divergence iff Ô = 0*.! 


Unbiased estimators 
The bias of an estimator is defined as 
bias(0(-)) = Enjo.) [0(D) — 6] (6.32) 


where 0, is the true parameter value. If the bias is zero, the estimator is called unbiased. This 
means the sampling distribution is centered on the true parameter. For example, the MLE for a 
Gaussian mean is unbiased: 


be N 
bias(fi) = E [z] — u = lad -u= = = (6.33) 


4. If the model is unidentifiable, the MLE may select a set of parameters that is different from the true parameters 
but for which the induced distribution, p(-|@), is the same as the exact distribution. Such parameters are said to be 
likelihood equivalent. 
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However, the MLE for a Gaussian variance, ĉĉ, is not an unbiased estimator of o°. In fact, one 
can show (Exercise 6.3) that 


[67] = Sis Lg? (6.34) 


However, the following estimator 


. N. 1 E 
Gia = ea = Woy (t T)? (6.35) 


is an unbiased estimator, which we can easily prove as follows: 


i (6)_4| =E Fe za? = A ; Ai tg? =o? (6.36) 
In Matlab, var (X) returns Ona whereas var(X,1) returns ô? (the MLE). For large enough 
N, the difference will be negligible. 

Although the MLE may sometimes be a biased estimator, one can show that asymptotically, it 
is always unbiased. (This is necessary for the MLE to be a consistent estimator.) 

Although being unbiased sounds like a desirable property, this is not always true. See Sec- 
tion 6.4.4 and (Lindley 1972) for discussion of this point. 


Minimum variance estimators 


It seems intuitively reasonable that we want our estimator to be unbiased (although we shall 
give some arguments against this claim below). However, being unbiased is not enough. For 
example, suppose we want to estimate the mean of a Gaussian from D = {2,...,uy}. The 
estimator that just looks at the first data point, 6(D) = tı, is an unbiased estimator, but will 
generally be further from 8, than the empirical mean 7 (which is also unbiased). So the variance 
of an estimator is also important. 

A natural question is: how long can the variance go? A famous result, called the Cramer- 
Rao lower bound, provides a lower bound on the variance of any unbiased estimator. More 
precisely, 


Theorem 6.4.1 (Cramer-Rao inequality). Let X,,...,X, ~ p(X|@9) and 6 = 6(x1, sieg En) be 
an unbiased estimator of 09. Then, under various smoothness assumptions on p(X |09), we have 


var [é] Sar (6.37) 


where I(89) is the Fisher information matrix (see Section 6.2.2). 
A proof can be found e.g., in (Rice 1995, p275). 
It can be shown that the MLE achieves the Cramer Rao lower bound, and hence has the 


smallest asymptotic variance of any unbiased estimator. Thus MLE is said to be asymptotically 
optimal. 
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The bias-variance tradeoff 


Although using an unbiased estimator seems like a good idea, this is not always the case. To see 
why, suppose we use quadratic loss. As we showed above, the corresponding risk is the MSE. 
We now derive a very useful decomposition of the MSE. (All expectations and variances are wrt 
the true distribution p(D|0*), but we drop the explicit conditioning for notational brevity.) Let 


_ Ê(D) denote the estimate, and 6 = E [4] denote the expected value of the estimate (as we 


vary D). Then we have 


z ê - 6*)?] = sf-a j a-o (6.38) 
z :|(6-2) | +20 — 6*)E [ô — a| KE-0)? (6.39) 
-= E iG = 8) Í + 0-0} (6.40) 
= var [é] + bias? (Ô) (6.41) 
In words, 
MSE = variance + bias? (6.42) 


This is called the bias-variance tradeoff (see e.g., (Geman et al. 1992)). What it means is that 
it might be wise to use a biased estimator, so long as it reduces our variance, assuming our goal 
is to minimize squared error. 


Example: estimating a Gaussian mean 


Let us give an example, based on (Hoff 2009, p79). Suppose we want to estimate the mean of a 
Gaussian from x = (x1,..., £y). We assume the data is sampled from z; ~ N(0* = 1,07). 


An obvious estimate is the MLE. This has a bias of 0 and a variance of 


oc 


var [7|0*] = W (6.43) 


But we could also use a MAP estimate. In Section 4.6.1, we show that the MAP estimate under 
a Gaussian prior of the form M (0o, 07/Ko) is given by 


KO 


Nias Nae 


where 0 < w < 1 controls how much we trust the MLE compared to our prior. (This is also the 

posterior mean, since the mean and mode of a Gaussian are the same.) The bias and variance 

are given by 

z [z] — 0* = w+ (1-— w) — &* = (1 — w) (fo — &*) (6.45) 
2 

27 


w W (6.46) 


llo 


x 


var [z] 
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Figure 6.4 Left: Sampling distribution of the MAP estimate with different prior strengths xo. (The MLE 
corresponds to Ko = 0.) Right: MSE relative to that of the MLE versus sample size. Based on Figure 5.6 of 
(Hoff 2009). Figure generated by samplingDistGaussShrinkage. 


So although the MAP estimate is biased (assuming w < 1), it has lower variance. 

Let us assume that our prior is slightly misspecified, so we use 0) = 0, whereas the truth is 
0* = 1. In Figure 6.4(a), we see that the sampling distribution of the MAP estimate for xo > 0 
is biased away from the truth, but has lower variance (is narrower) than that of the MLE. 

In Figure 6.4(b), we plot mse(%)/mse(%) vs N. We see that the MAP estimate has lower MSE 
than the MLE, especially for small sample size, for Ko € {1,2}. The case ko = 0 corresponds to 
the MLE, and the case Ko = 3 corresponds to a strong prior, which hurts performance because 
the prior mean is wrong. It is clearly important to “tune” the strength of the prior, a topic we 
discuss later. 


Example: ridge regression 


Another important example of the bias variance tradeoff arises in ridge regression, which we 
discuss in Section 7.5. In brief, this corresponds to MAP estimation for linear regression under 
a Gaussian prior, p(w) = N(w|0,\~!I) The zero-mean prior encourages the weights to be 
small, which reduces overfitting; the precision term, À, controls the strength of this prior. Setting 
A = 0 results in the MLE; using À > 0 results in a biased estimate. To illustrate the effect on 
the variance, consider a simple example. Figure 6.5 on the left plots each individual fitted curve, 
and on the right plots the average fitted curve. We see that as we increase the strength of the 
regularizer, the variance decreases, but the bias increases. 


Bias-variance tradeoff for classification 


If we use 0-1 loss instead of squared error, the above analysis breaks down, since the frequentist 
risk is no longer expressible as squared bias plus variance. In fact, one can show (Exercise 7.2 
of (Hastie et al. 2009)) that the bias and variance combine multiplicatively. If the estimate is on 
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Figure 6.5 Illustration of bias-variance tradeoff for ridge regression. We generate 100 data sets from the 
true function, shown in solid green. Left: we plot the regularized fit for 20 different data sets. We use 
linear regression with a Gaussian RBF expansion, with 25 centers evenly spread over the [0,1] interval. 
Right: we plot the average of the fits, averaged over all 100 datasets. Top row: strongly regularized: we see 
that the individual fits are similar to each other (low variance), but the average is far from the truth (high 
bias). Bottom row: lightly regularized: we see that the individual fits are quite different from each other 
(high variance), but the average is close to the truth (low bias). Based on (Bishop 2006a) Figure 3.5. Figure 
generated by biasVarModelComplexity3. 


the correct side of the decision boundary, then the bias is negative, and decreasing the variance 
will decrease the misclassification rate. But if the estimate is on the wrong side of the decision 
boundary, then the bias is positive, so it pays to increase the variance (Friedman 1997a). This 
little known fact illustrates that the bias-variance tradeoff is not very useful for classification. 
It is better to focus on expected loss (see below), not directly on bias and variance. We can 
approximate the expected loss using cross validatinon, as we discuss in Section 6.5.3. 


Empirical risk minimization 


Frequentist decision theory suffers from the fundamental problem that one cannot actually 
compute the risk function, since it relies on knowing the true data distribution. (By contrast, 
the Bayesian posterior expected loss can always be computed, since it conditions on the the 
data rather than conditioning on 0*.) However, there is one setting which avoids this problem, 
and that is where the task is to predict observable quantities, as opposed to estimating hidden 
variables or parameters. That is, instead of looking at loss functions of the form L(0,6(D)), 
where @ is the true but unknown parameter, and ô(D) is our estimator, let us look at loss 
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functions of the form L(y, ô(x)), where y is the true but unknown response, and d(x) is our 
prediction given the input x. In this case, the frequentist risk becomes 


R( ps8) = Evx.y)~p. [E ee y, 5(x)) Px (x, y) (6.47) 


where p, represents “nature's distribution”. Of course, this distribution is unknown, but a simple 
approach is to use the empirical distribution, derived from some training data, to approximate 
Per Le, 


al 
Pa (X, Y) = Pemp (x, Y) = F 2 dx: (x)ôy (y) (6.48) 


We then define the empirical risk as follows: 


N 
1 
Remp(D, D) = R(Pemp; ô) = N D L(y, 5(x;)) (6.49) 


In the case of 0-1 loss, L(y, 6(x)) = I(y 4 6(x)), this becomes the misclassification rate. In 
the case of squared error loss, L(y, 6(x)) = (y—6(x))?, this becomes the mean squared error. 
We define the task of empirical risk minimization or ERM as finding a decision procedure 
(typically a classification rule) to minimize the empirical risk: 


OERM (D) = argmin Remp(D, ô) (6.50) 

ô 
In the unsupervised case, we eliminate all references to y, and replace L(y, ô(x)) with 
L(x, 6(x)), where, for example, L(x, 5(x)) = ||x — ô(x)||2, which measures the reconstruc- 


tion error. We can define the decision rule using ô(x) = decode(encode(x)), as in vector 
quantization (Section 11.4.2.6) or PCA (section 12.2). Finally, we define the empirical risk as 


Remp(D, 8) = S (xi, 5(x:)) (6.51) 


Of course, we can always trivially minimize this risk by setting 6(x) = x, so it is critical that 
the encoder-decoder go via some kind of bottleneck. 
Regularized risk minimization 


Note that the empirical risk is equal to the Bayes risk if our prior about “nature’s distribution” is 
that it is exactly equal to the empirical distribution (Minka 2001b): 


J [R(px, 5) |px = Pemp] = Remp (D, ô) (6.52) 


Therefore minimizing the empirical risk will typically result in overfitting. It is therefore often 
necessary to add a complexity penalty to the objective function: 


R'(D, 5) = Remp(D, 5) + AC (ô) (6.53) 
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where C’'(d) measures the complexity of the prediction function (x) and A controls the strength 
of the complexity penalty. This approach is known as regularized risk minimization (RRM). 
Note that if the loss function is negative log likelihood, and the regularizer is a negative log 
prior, this is equivalent to MAP estimation. 

The two key issues in RRM are: how do we measure complexity, and how do we pick A. For 
a linear model, we can define the complexity of in terms of its degrees of freedom, discussed in 
Section 7.5.3. For more general models, we can use the VC dimension, discussed in Section 6.5.4. 
To pick À, we can use the methods discussed in Section 6.5.2. 


Structural risk minimization 


The regularized risk minimization principle says that we should fit the model, for a given 
complexity penalty, by using 


dy = argmin [Remp(D, 6) + AC(8)] (6.54) 
ô 


But how should we pick A? We cannot using the training set, since this will underestimate the 
true risk, a problem known as optimism of the training error. As an alternative, we can use 
the following rule, known as the structural risk minimization principle: (Vapnik 1998): 


Â = argmin R(d)) (6.55) 
Xr 


where R(6) is an estimate of the risk. There are two widely used estimates: cross validation 
and theoretical upper bounds on the risk. We discuss both of these below. 


Estimating the risk using cross validation 


We can estimate the risk of some estimator using a validation set. If we don’t have a separate 
validation set, we can use cross validation (CV), as we briefly discussed in Section 1.4.8. More 
precisely, CV is defined as follows. Let there be N = |D| data cases in the training set. Denote 
the data in the k’th test fold by Dy and all the other data by D_x. (In stratified CV, these folds 
are chosen so the class proportions (if discrete labels are present) are roughly equal in each 
fold.) Let F be a learning algorithm or fitting function that takes a dataset and a model index 
m (this could a discrete index, such as the degree of a polynomial, or a continuous index, such 
as the strength of a regularizer) and returns a parameter vector: 


Ôm = F(D,m) (6.56) 


Finally, let P be a prediction function that takes an input and a parameter vector and returns a 
prediction: 


9 = P(x, 6) = f(x, ê) (6.57) 
Thus the combined fit-predict cycle is denoted as 
fin(x,D) = P(x, F(D, m)) (6.58) 
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The K-fold CV estimate of the risk of fm is defined by 
K 
aad 
R(m,D, K) = ș 5 2 L (yi, P(xi, F(D_x,m))) (6.59) 
k=1 iE€Dk 
Note that we can call the fitting algorithm once per fold. Let fE (x) = P(x, F(D_x,m)) be 
the function that was trained on all the data except for the test data in fold k. Then we can 
rewrite the CV estimate as 


N 
R(m,D, K) > > iets) f = E(w (x:)) (6.60) 


NZ 1iEDk i=l 
where k(i) is the fold in which 7 is used as test data. In other words, we predict y; using a 
model that was trained on data that does not contain x;. 
Of K = N, the method is known as leave one out cross validation or LOOCV. In this case, 
n the estimated risk a 


R(m,D, N) FE (vi, fn’ (xi) (6.61) 


where ft (x) = P(x,#(D_i,m)). This requires fitting the model N times, where for f,’ we 
omit the 7’th training case. Fortunately, for some model classes and loss functions (namely linear 
models and quadratic loss), we can fit the model once, and analytically “remove” the effect of 
the i'th training case. This is known as generalized cross validation or GCV. 


Example: using CV to pick A for ridge regression 


As a concrete example, consider picking the strength of the > regularizer in penalized linear 
regression. We use the following rule: 

\=arg min —_R(A, Dirain, K) (6.62) 

AE[Amin Amaz 
where [Amin; Amaz] is a finite range of A values that we search over, and R(A, Dirain, K) is the 
K-fold CV estimate of the risk of using A, given by 
K 

>. >. Lyi, FA) (6.63) 


k=1 iEDk 


1 


R(A, Dirain; K)= (eae | 


where f¥(x) = x7 W)(D_,) is the prediction function trained on data excluding fold k, and 
Ww (D) = arg minw NLL(w,D) + A||w||3 is the MAP estimate. Figure 6.6(b) gives an example 
of a CV estimate of the risk vs log(A), where the loss function is squared error. 

When performing classification, we usually use 0-1 loss. In this case, we optimize a convex 
upper bound on the empirical risk to estimate w,m but we optimize (the CV estimate of) the 
risk itself to estimate À. We can handle the non-smooth 0-1 loss function when estimating A 
because we are using brute-force search over the entire (one-dimensional) space. 

When we have more than one or two tuning parameters, this approach becomes infeasible. 
In such cases, one can use empirical Bayes, which allows one to optimize large numbers of 
hyper-parameters using gradient-based optimizers instead of brute-force search. See Section 5.6 
for details. 
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mean squared error 5-fold cross validation, ntrain = 50 
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Figure 6.6 (a) Mean squared error for £2 penalized degree 14 polynomial regression vs log regularizer. 
Same as in Figures 7.8, except now we have N = 50 training points instead of 21. The stars correspond 
to the values used to plot the functions in Figure 7.7. (b) CV estimate. The vertical scale is truncated for 
clarity. The blue line corresponds to the value chosen by the one standard error rule. Figure generated by 
linregPolyVsRegDemo. 


The one standard error rule 


The above procedure estimates the risk, but does not give any measure of uncertainty. A 
standard frequentist measure of uncertainty of an estimate is the standard error of the mean, 


defined by 


à 52 


z 2 (6.64) 
Se = eo = = X 
VN N 
where G? is an estimate of the variance of the loss: 
ae i 
e .— T)2 a . fkl) (y. Tea . 
=z d {hi D), Li=Lyisfm(xi)) L= x dt (6.65) 


Note that ø measures the intrinsic variability of L; across samples, whereas se measures our 
uncertainty about the mean L. 

Suppose we apply CV to a set of models and compute the mean and se of their estimated 
risks. A common heuristic for picking a model from these noisy estimates is to pick the value 
which corresponds to the simplest model whose risk is no more than one standard error above 
the risk of the best model; this is called the one-standard error rule (Hastie et al. 2001, p216). 
For example, in Figure 6.6, we see that this heuristic does not choose the lowest point on the 
curve, but one that is slightly to its right, since that corresponds to a more heavily regularized 
model with essentially the same empirical performance. 
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CV for model selection in non-probabilistic unsupervised learning 


If we are performing unsupervised learning, we must use a loss function such as L(x, 6(x)) = 
||x — d(x)||2, which measures reconstruction error. Here 6(x) is some encode-decode scheme. 
However, as we discussed in Section 11.5.2, we cannot use CV to determine the complexity of 6, 
since we will always get lower loss with a more complex model, even if evaluated on the test set. 
This is because more complex models will compress the data less, and induce less distortion. 
Consequently, we must either use probabilistic models, or invent other heuristics. 


Upper bounding the risk using statistical learning theory * 


The principle problem with cross validation is that it is slow, since we have to fit the model 
multiple times. This motivates the desire to compute analytic approximations or bounds to 
the generalization error. This is the studied in the field of statistical learning theory (SLT). 
More precisely, SLT tries to bound the risk R(p,,h) for any data distribution p, and hypothesis 
h € H in terms of the empirical risk Rem ,(D,h), the sample size N = |D], and the size of the 
hypothesis space H. 

Let us initially consider the case where the hypothesis space is finite, with size dim(H) = |H]. 
In other words, we are selecting a model/ hypothesis from a finite list, rather than optimizing 
real-valued parameters, Then we can prove the following. 


Theorem 6.5.1. For any data distribution p,, and any dataset D of size N drawn from px, the 
probability that our estimate of the error rate will be more than € wrong, in the worst case, is upper 
bounded as follows: 


P (pa Rens (P, h) — R(p.,h)| > e) < 2dim(H)e 72N? (6.66) 


Proof. To prove this, we need two useful results. First, Hoeffding’s inequality, which states that 
if X1, ..., Xn ~ Ber(0), then, for any e€ > 0, 


P(|z—0| > €) < 2072n? (6.67) 


where 7 = x Sy zi. Second, the union bound, which says that if A,,..., Aq are a set of 
events, then P(U¢_, Ai) < Zo P(4:). 

Finally, for notational brevity, let R(h) = R(h, px) be the true risk, and Êy (h) = Remp(D, h) 
be the empirical risk. 

Using these results we have 


P (payl) — R(h)| > e) = P ( LJ |Rw(h) — R(A)| > «) (6.68) 


hEH 
hEH 


< EP (lÊn(h) — R(h)| > €) (6.69) 
hEH 
< Y ai cnda (6.70) 


hEH 
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Ths bound tells us that the optimism of the training error increases with dim(#) but de- 
creases with N = |D], as is to be expected. 

If the hypothesis space H is infinite (e.g, we have real-valued parameters), we cannot use 
dim(H) = |H]. Instead, we can use a quantity called the Vapnik-Chervonenkis or VC dimen- 
sion of the hypothesis class. See (Vapnik 1998) for details. 

Stepping back from all the theory, the key intuition behind statistical learning theory is quite 
simple. Suppose we find a model with low empirical risk. If the hypothesis space H is very 
big, relative to the data size, then it is quite likely that we just got “lucky” and were given a 
data set that is well-modeled by our chosen function by chance. However, this does not mean 
that such a function will have low generalization error. But if the hypothesis class is sufficiently 
constrained in size, and/or the training set is sufficiently large, then we are unlikely to get lucky 
in this way, so a low empirical risk is evidence of a low true risk. 

Note that optimism of the training error does not necessarily increase with model complexity, 
but it does increase with the number of different models that are being searched over. 

The advantage of statistical learning theory compared to CV is that the bounds on the risk 
are quicker to compute than using CV. The disadvantage is that it is hard to compute the VC 
dimension for many interesting models, and the upper bounds are usually very loose (although 
see (Kaariainen and Langford 2005)). 

One can extend statistical learning theory by taking computational complexity of the learner 
into account. This field is called computational learning theory or COLT. Most of this work 
focuses on the case where h is a binary classifier, and the loss function is 0-1 loss. If we observe 
a low empirical risk, and the hypothesis space is suitably “small”, then we can say that our 
estimated function is probably approximately correct or PAC. A hypothesis space is said to be 
efficiently PAC-learnable if there is a polynomial time algorithm that can identify a function 
that is PAC. See (Kearns and Vazirani 1994) for details. 


Surrogate loss functions 


Minimizing the loss in the ERM/ RRM framework is not always easy. For example, we might 
want to optimize the AUC or FI scores. Or more simply, we might just want to minimize the 0-1 
loss, as is common in classification. Unfortunately, the 0-1 risk is a very non-smooth objective 
and hence is hard to optimize. One alternative is to use maximum likelihood estimation instead, 
since log-likelihood is a smooth convex upper bound on the 0-1 risk, as we show below. 

To see this, consider binary logistic regression, and let y; E€ {—1, +1}. Suppose our decision 
function computes the log-odds ratio, 


= 1)x;, 
f (xi) = log Py = TW) L wl, = (6.71) 
p(y = —1|%:, w) 
Then the corresponding probability distribution on the output label is 
pluilxi w) = sigm(yimn) (6.72) 


Let us define the log-loss as as 


Ly (y. 7) = — log p(ylx, w) = log(1 + e~#") (6.73) 
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Figure 6.7 Illustration of various loss functions for binary classification. The horizontal axis is the margin 
yn, the vertical axis is the loss. The log loss uses log base 2. Figure generated by hingeLossPlot. 


It is clear that minimizing the average log-loss is equivalent to maximizing the likelihood. 


Now consider computing the most probable label, which is equivalent to using 7 = —1 if 
ni < 0 and ĝ = +1 if n; > 0. The 0-1 loss of our function becomes 

Loi(y, n) = Wy # 9) = Iyn < 0) (6.74) 
Figure 6.7 plots these two loss functions. We see that the NLL is indeed an upper bound on the 
0-1 loss. 


Log-loss is an example of a surrogate loss function. Another example is the hinge loss: 
Lhinge(Y 17) = max(0, 1 — yn) (6.75) 


See Figure 6.7 for a plot. We see that the function looks like a door hinge, hence its name. 
This loss function forms the basis of a popular classification method known as support vector 
machines (SVM), which we will discuss in Section 14.5. 

The surrogate is usually chosen to be a convex upper bound, since convex functions are easy 
to minimize. See e.g., (Bartlett et al. 2006) for more information. 


6.6 Pathologies of frequentist statistics * 


I believe that it would be very difficult to persuade an intelligent person that current 
[frequentist] statistical practice was sensible, but that there would be much less difficulty 
with an approach via likelihood and Bayes’ theorem. — George Box, 1962. 


Frequentist statistics exhibits various forms of weird and undesirable behaviors, known as 
pathologies. We give a few examples below, in order to caution the reader; these and other 
examples are explained in more detail in (Lindley 1972; Lindley and Phillips 1976; Lindley 1982; 
Berger 1985; Jaynes 2003; Minka 1999). 
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Counter-intuitive behavior of confidence intervals 


A confidence interval is an interval derived from the sampling distribution of an estimator 
(whereas a Bayesian credible interval is derived from the posterior of a parameter, as we dis- 
cussed in Section 5.2.2). More precisely, a frequentist confidence interval for some parameter 0 
is defined by the following (rather un-natural) expression: 


C! (0) = (£, u) : PUD) < 8 < u(D)|D~ 0) =1-a (6.76) 


That is, if we sample hypothetical future data D from 0, then (¢(D),u(D)), is a confidence 
interval if the parameter 0 lies inside this interval 1 — a percent of the time. 

Let us step back for a moment and think about what is going on. In Bayesian statistics, 
we condition on what is known — namely the observed data, D — and average over what 
is not known, namely the parameter 0. In frequentist statistics, we do exactly the opposite: 
we condition on what is unknown — namely the true parameter value 6 — and average over 
hypothetical future data sets D. 

This counter-intuitive definition of confidence intervals can lead to bizarre results. Consider 
the following example from (Berger 1985, pll). Suppose we draw two integers D = (x1, x2) from 


0.5 ifx=0 
p(zl0)= 4 0.5 ifr=0+1 (6.77) 
0 otherwise 


If 0 = 39, we would expect the following outcomes each with probability 0.25: 

(39, 39), (39, 40), (40, 39), (40, 40) (6.78) 
Let m = min(x1, £2) and define the following confidence interval: 

(D), u(D)] = [m, m] (6.79) 
For the above samples this yields 

[39,39], [39,39], [39,39], [40, 40] (6.80) 


Hence Equation 6.79 is clearly a 75% CI, since 39 is contained in 3/4 of these intervals. However, 
if D = (39, 40) then p(0 = 39|D) = 1.0, so we know that 0 must be 39, yet we only have 75% 
“confidence” in this fact. 

Another, less contrived example, is as follows. Suppose we want to estimate the parameter 0 
of a Bernoulli distribution. Let z = + yG x; be the sample mean. The MLE is Ê= 7T. An 
approximate 95% confidence interval for a Bernoulli parameter is T+ 1.96,/Z(1 — T)/N (this is 


called a Wald interval and is based on a Gaussian approximation to the Binomial distribution; 


compare to Equation 3.27). Now consider a single trial, where N = 1 and zı = 0. The MLE 
is 0, which overfits, as we saw in Section 3.3.4.1. But our 95% confidence interval is also (0,0), 
which seems even worse. It can be argued that the above flaw is because we approximated 
the true sampling distribution with a Gaussian, or because the sample size was to small, or the 
parameter “too extreme”. However, the Wald interval can behave badly even for large N, and 
non-extreme parameters (Brown et al. 2001). 
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p-values considered harmful 


Suppose we want to decide whether to accept or reject some baseline model, which we will 
call the null hypothesis. We need to define some decision rule. In frequentist statistics, it 
is standard to first compute a quantity called the p-value, which is defined as the probability 
(under the null) of observing some test statistic f(D) (such as the chi-squared statistic) that is 
as large or larger than that actually observed:° 


pvalue(D)  P(f(D) > f(D)|D ~ Ho) (6.81) 


This quantity relies on computing a tail area probability of the sampling distribution; we give 
an example of how to do this below. 

Given the p-value, we define our decision rule as follows: we reject the null hypothesis iff the 
p-value is less than some threshold, such as a = 0.05. If we do reject it, we say the difference 
between the observed test statistic and the expected test statistic is statistically significant at 
level a. This approach is known as null hypothesis significance testing, or NHST. 

This procedure guarantees that our expected type I (false positive) error rate is at most a. 
This is sometimes interpreted as saying that frequentist hypothesis testing is very conservative, 
since it is unlikely to accidently reject the null hypothesis. But in fact the opposite is the case: 
because this method only worries about trying to reject the null, it can never gather evidence 
in favor of the null, no matter how large the sample size. Because of this, p-values tend to 
overstate the evidence against the null, and are thus very “trigger happy”. 

In general there can be huge differences between p-values and the quantity that we really 
care about, which is the posterior probability of the null hypothesis given the data, p(Ho|D). 
In particular, Sellke et al. (2001) show that even if the p-value is as slow as 0.05, the posterior 
probability of Ho is at least 30%, and often much higher. So frequentists often claim to have 
“significant” evidence of an effect that cannot be explained by the null hypothesis, whereas 
Bayesians are usually more conservative in their claims. For example, p-values have been used 
to “prove” that ESP (extra-sensory perception) is real (Wagenmakers et al. 2011), even though ESP 
is clearly very improbable. For this reason, p-values have been banned from certain medical 
journals (Matthews 1998). 

Another problem with p-values is that their computation depends on decisions you make 
about when to stop collecting data, even if these decisions don't change the data you actually 
observed. For example, suppose I toss a coin n = 12 times and observe s = 9 successes (heads) 
and f = 3 failures (tails), so n = s + f. In this case, n is fixed and s (and hence f) is random. 
The relevant sampling model is the binomial 


Bin(s|n, 0) = (") g°(1—9)"-8 (6.82) 
s 
Let the null hypothesis be that the coin is fair, 9 = 0.5, where 0 is the probability of success 
(heads). The one-sided p-value, using test statistic t(s) = s, is 
12 12 49 
= P(S > 9|Ho) = X Bin(s|12, 0.5) = aS 0, 6. 
pı = P(S > 9|Ho) 2 in(s|12, 0.5) > (;)o 0.073 (6.83) 


5. The reason we cannot just compute the probability of the observed value of the test statistic is that this will have 
probability zero under a pdf. The p-value is defined in terms of the cdf, so is always a number between 0 and 1. 
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The two-sided p-value is 


12 3 
p2 = > Bin(s|12,0.5) + X- Bin(s|12, 0.5) = 0.073 + 0.073 = 0.146 (6.84) 
s=9 s=0 


In either case, the p-value is larger than the magical 5% threshold, so a frequentist would not 
reject the null hypothesis. 

Now suppose I told you that I actually kept tossing the coin until I observed f = 3 tails. In 
this case, f is fixed and n (and hence s = n — f) is random. The probability model becomes 
the negative binomial distribution, given by 


+f-1 
NegBinom(s|f,) = C - i ora oy! (6.85) 
where f =n-— 58. 

Note that the term which depends on @ is the same in Equations 6.82 and 6.85, so the 
posterior over 6 would be the same in both cases. However, these two interpretations of the 
same data give different p-values. In particular, under the negative binomial model we get 


ps = P(S > 9|Ho) = a (° a : 7 ‘) (1/2)°(1/2)* = 0.0327 (6.86) 


s=9 


So the p-value is 3%, and suddenly there seems to be significant evidence of bias in the coin! 
Obviously this is ridiculous: the data is the same, so our inferences about the coin should be 
the same. After all, I could have chosen the experimental protocol at random. It is the outcome 
of the experiment that matters, not the details of how I decided which one to run. 

Although this might seem like just a mathematical curiosity, this also has significant practical 
implications. In particular, the fact that the stopping rule affects the computation of the p- 
value means that frequentists often do not terminate experiments early, even when it is obvious 
what the conclusions are, lest it adversely affect their statistical analysis. If the experiments are 
costly or harmful to people, this is obviously a bad idea. Perhaps it is not surprising, then, that 
the US Food and Drug Administration (FDA), which regulates clinical trials of new drugs, has 
recently become supportive of Bayesian methods®, since Bayesian methods are not affected by 
the stopping rule. 


The likelihood principle 


The fundamental reason for many of these pathologies is that frequentist inference violates 
the likelihood principle, which says that inference should be based on the likelihood of the 
observed data, not based on hypothetical future data that you have not observed. Bayes obviously 
satisfies the likelihood principle, and consequently does not suffer from these pathologies. 

A compelling argument in favor of the likelihood principle was presented in (Birnbaum 1962), 
who showed that it followed automatically from two simpler principles. The first of these is the 
sufficiency principle, which says that a sufficient statistic contains all the relevant information 


6. See http: //yamlb.wordpress.com/2006/06/19/the-us-fda-is-becoming-progressively-more-bayes 
ian/. 
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about an unknown parameter (arguably this is true by definition). The second principle is 
known as weak conditionality, which says that inferences should be based on the events that 
happened, not which might have happened. To motivate this, consider an example from (Berger 
1985). Suppose we need to analyse a substance, and can send it either to a laboratory in New 
York or in California. The two labs seem equally good, so a fair coin is used to decide between 
them. The coin comes up heads, so the California lab is chosen. When the results come back, 
should it be taken into account that the coin could have come up tails and thus the New York 
lab could have been used? Most people would argue that the New York lab is irrelevant, since 
the tails event didn't happen. This is an example of weak conditionality. Given this principle, 
one can show that all inferences should only be based on what was observed, which is in 
contrast to standard frequentist procedures. See (Berger and Wolpert 1988) for further details on 
the likelihood principle. 


Why isn’t everyone a Bayesian? 


Given these fundamental flaws of frequentist statistics, and the fact that Bayesian methods 
do not have such flaws, an obvious question to ask is: “Why isn’t everyone a Bayesian?” The 
(frequentist) statistician Bradley Efron wrote a paper with exactly this title (Efron 1986). His short 
paper is well worth reading for anyone interested in this topic. Below we quote his opening 
section: 


The title is a reasonable question to ask on at least two counts. First of all, everone used 
to be a Bayesian. Laplace wholeheatedly endorsed Bayes’s formulation of the inference 
problem, and most 19th-century scientists followed suit. This included Gauss, whose 
statistical work is usually presented in frequentist terms. 


A second and more important point is the cogency of the Bayesian argument. Modern 
statisticians, following the lead of Savage and de Finetti, have advanced powerful theoret- 
ical arguments for preferring Bayesian inference. A byproduct of this work is a disturbing 
catalogue of inconsistencies in the frequentist point of view. 


Nevertheless, everyone is not a Bayesian. The current era (1986) is the first century in 
which statistics has been widely used for scientific reporting, and in fact, 20th-century 
statistics is mainly non-Bayesian. However, Lindley (1975) predicts a change for the 21st 
century. 


Time will tell whether Lindley was right... 


Exercises 


Exercise 6.1 Pessimism of LOOCV 


(Source: Witten05, p152.). Suppose we have a completely random labeled dataset (i.e., the features x tell us 
nothing about the class labels y) with Ni examples of class 1, and N2 examples of class 2, where Ni = No. 
What is the best misclassification rate any method can achieve? What is the estimated misclassification 
rate of the same method using LOOCV? 


Exercise 6.2 James Stein estimator for Gaussian means 


Consider the 2 stage model Y;|0; ~ N (0;, o°) and 6;|40 ~ N (mo, Të). Suppose o° = 500 is known and 
we observe the following 6 data points, i = 1 : 6: 
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1505, 1528, 1564, 1498, 1600, 1470 


a. Find the MLI estimates of mo and rê. 


b. Find the posterior estimates E [0;|y:, mo, To] and var [@:|yi,™mo, To] for i = 1. (The other terms, 
i = 2 : 6, are computed similarly.) 


c. Give a 95% credible interval for p(@:|yi, mo, To) for i = 1. Do you trust this interval (assuming the 
Gaussian assumption is reasonable)? i.e. is it likely to be too large or too small, or just right? 


d. What do you expect would happen to your estimates if o? were much smaller (say o? = 1)? You do 
not need to compute the numerical answer; just briefly explain what would happen qualitatively, and 
why. 


Exercise 6.3 64,7, is biased 


Show that 64,5 = - SiG (£n — Ô)? is a biased estimator of o°, i.e., show 


Ex, ,... Xan N (u0) [o7(X1, ..3 Xa) # o’ 


Hint: note that Xı,..., Xn are independent, and use the fact that the expectation of a product of 
independent random variables is the product of the expectations. 
Exercise 6.4 Estimation of ø? when p is known 


Suppose we sample x1,...,2N ~ N (1,07) where ju is a known constant. Derive an expression for the 
MLE for ø° in this case. Is it unbiased? 
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Linear regression 


Introduction 


Linear regression is the “work horse” of statistics and (supervised) machine learning. When 
augmented with kernels or other forms of basis function expansion, it can model also non- 
linear relationships. And when the Gaussian output is replaced with a Bernoulli or multinoulli 
distribution, it can be used for classification, as we will see below. So it pays to study this model 
in detail. 


Model specification 


As we discussed in Section 1.4.5, linear regression is a model of the form 
plylx, 0) = N(y|w" x, 0”) (7) 


Linear regression can be made to model non-linear relationships by replacing x with some 
non-linear function of the inputs, @(x). That is, we use 


p(y|x, 0) = N (ylw? (x), 07) (7.2) 


This is known as basis function expansion. (Note that the model is still linear in the parameters 
w, so it is still called linear regression; the importance of this will become clear below.) A simple 
example are polynomial basis functions, where the model has the form 


p(x) =[1,2,27,..., 24] (7.3) 


Figure 1.18 illustrates the effect of changing d: increasing the degree d allows us to create 
increasingly complex functions. 

We can also apply linear regression to more than 1 input. For example, consider modeling 
temperature as a function of location. Figure 7.1(a) plots E [y|x] = wo + wia1 + w2z2, and 
Figure 7.1(b) plots E [y|x] = wo + wy 21 + Wer2 + w3x? + w4z2. 


Maximum likelihood estimation (least squares) 


A common way to esitmate the parameters of a statistical model is to compute the MLE, which 
is defined as 


6 £ arg max log p(D|@) (7.4) 
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Figure 7.1 Linear regression applied to 2d data. Vertical axis is temperature, horizontal axes are location 
within a room. Data was collected by some remote sensing motes at Intel’s lab in Berkeley, CA (data 
courtesy of Romain Thibaux). (a) The fitted plane has the form f (x) = wo + witi + were. (b) 
Temperature data is fitted with a quadratic of the form f(x) = wo + wir + were 4 wr? } wars. 
Produced by surfaceFitDemo. 


It is common to assume the training examples are independent and identically distributed, 
commonly abbreviated to iid. This means we can write the log-likelihood as follows: 


N 
0(0) = log p(D|@) = $` log p(yi |x, 0) (7.5) 
i=l 


Instead of maximizing the log-likelihood, we can equivalently minimize the negative log likeli- 
hood or NLL: 


N 
NLL (6) = — Slog p(yilxi, 0) (7.6) 
=1 


The NLL formulation is sometimes more convenient, since many optimization software packages 
are designed to find the minima of functions, rather than maxima. 

Now let us apply the method of MLE to the linear regression setting. Inserting the definition 
of the Gaussian into the above, we find that the log likelihood is given by 


N 1 
1 2 1 T2 
(0) = > be (=) exp (-sru — wt xi) ) (7.7) 
= N g 
= 592 PSS (w) E log(270~) (7.8) 
RSS stands for residual sum of squares and is defined by 
N 
RSS(w) = X mi- wx) (7.9) 
i=1 


The RSS is also called the sum of squared errors, or SSE, and SSE/N is called the mean 
squared error or MSE. It can also be written as the square of the /2 norm of the vector of 
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Sum of squares error contours for linear regression 


a 3 3 

prediction <a — 
Aof ccc truth xo 25 
3h ee 2 


Figure 7.2 (a) In linear least squares, we try to minimize the sum of squared distances from each training 
point (denoted by a red circle) to its approximation (denoted by a blue cross), that is, we minimize the 
sum of the lengths of the little vertical blue lines. The red diagonal line represents ĝ(x) = wo + wiz, 
which is the least squares regression line. Note that these residual lines are not perpendicular to the least 
squares line, in contrast to Figure 12.5. Figure generated by residualsDemo. (b) Contours of the RSS error 
surface for the same example. The red cross represents the MLE, w = (1.45, 0.93). Figure generated by 
contoursSSEdemo. 


residual errors: 
N 
RSS(w) = |ie = Soe? (7.10) 
i=1 


where e; = (yi — wT x;). 

We see that the MLE for w is the one that minimizes the RSS, so this method is known 
as least squares. This method is illustrated in Figure 7.2(a) The training data (x;, yi) are 
shown as red circles, the estimated values (x;, ĝ;) are shown as blue crosses, and the residuals 
€i = Yi — Îi are shown as vertical blue lines. The goal is to find the setting of the parameters 
(the slope wı and intercept wo) such that the resulting red line minimizes the sum of squared 
residuals (the lengths of the vertical blue lines). 

In Figure 7.2(b), we plot the NLL surface for our linear regression example. We see that it is a 
quadratic “bowl” with a unique minimum, which we now derive. (Importantly, this is true even 
if we use basis function expansion, such as polynomials, because the NLL is still linear in the 
parameters w, even if it is not linear in the inputs x.) 


Derivation of the MLE 


First, we rewrite the objective in a form that is more amenable to differentiation: 


NLL(w) = <(y —Xw)?(y —Xw) = Sw" (X™X)w —wi(X*y) (7.1) 


NI = 
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Til e CiiTi,D 


N 
=) =) A (7.12) 


2 
Ti Dil 77" Ti D 


is the sum of squares matrix and 
N 
y=) xiy. (7.13) 
i=1 
Using results from Equation 4.10, we see that the gradient of this is given by 
N 
g(w) = [XXw - X"y] = 5 xi(W" xi — yi) (7.14) 
i=1 


Equating to zero we get 
X’Xw = X’y (7.15) 


This is known as the normal equation. The corresponding solution w to this linear system of 
equations is called the ordinary least squares or OLS solution, which is given by 


Wors = (XTX) 'XTy (7.16) 


Geometric interpretation 


This equation has an elegant geometrical intrepretation, as we now explain. We assume N > D, 
so we have more examples than features. The columns of X define a linear subspace of 
dimensionality D which is embedded in N dimensions. Let the j’th column be x;, which is 
a vector in RY. (This should not be confused with x; € RP, which represents the ith data 
case.) Similarly, y is a vector in R”. For example, suppose we have N = 3 examples in D = 2 
dimensions: 


i 2 8.8957 
X=[1 -2], y= | 0.6130 (7.17) 
i 2 1.7761 


These vectors are illustrated in Figure 7.3. 
We seek a vector y € RY that lies in this linear subspace and is as close as possible to y, 
i.e., we want to find 


argmin ly — y|le. (7.18) 


yespan({X1,...,X%p}) 
Since y € span(X), there exists some weight vector w such that 


ŷ = w,X, +--+ wpxXp = Xw (7.19) 
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Figure 7.3 Graphical interpretation of least squares for N = 3 examples and D = 2 features. 1 and 
Šə are vectors in R*; together they define a 2D plane. y is also a vector in R? but does not lie on this 
2D plane. The orthogonal projection of y onto this plane is denoted y. The red line from y to y is 
the residual, whose norm we want to minimize. For visual clarity, all vectors have been converted to unit 
norm. Figure generated by leastSquaresProjection. 


To minimize the norm of the residual, y — y, we want the residual vector to be orthogonal to 
every column of X, so xP (y —y) =0 for j = 1: D. Hence 


xi (y —y) =0=> X"(y —- Xw) = 0 > w= (XTX) 'X7y (7.20) 
Hence our projected value of y is given by 
y = XW = X(X?X) 1 XTy (7.21) 


This corresponds to an orthogonal projection of y onto the column space of X. The projection 
matrix P = X(X7X)~!X7 is called the hat matrix, since it “puts the hat on y”. 


Convexity 


When discussing least squares, we noted that the NLL had a bowl shape with a unique minimum. 
The technical term for functions like this is convex. Convex functions play a very important 
role in machine learning. 

Let us define this concept more precisely. We say a set S is convex if for any 0,0’ € S, we 
have 


A96+(1-A)O’ ES, VAE [0,1] (7.22) 
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(a) (b) 


Figure 7.4 (a) Illustration of a convex set. (b) Illustration of a nonconvex set. 


Figure 7.5 (a) Illustration of a convex function. We see that the chord joining (x, f(x)) to (y, f(y)) lies 
above the function. (b) A function that is neither convex nor concave. A is a local minimum, B is a global 
minimum. Figure generated by convexFnHand. 


That is, if we draw a line from @ to 6’, all points on the line lie inside the set. See Figure 7.4(a) 
for an illustration of a convex set, and Figure 7.4(b) for an illustration of a non-convex set. 

A function f(@) is called convex if its epigraph (the set of points above the function) defines 
a convex set. Equivalently, a function f(@) is called convex if it is defined on a convex set and 
if, for any 0,0’ € S, and for any 0 < À < 1, we have 


FAO + (1—d)6’) < AF(O) + (1 — A) F(O’) (7.23) 


See Figure 7.5 for a ld example. A function is called strictly convex if the inequality is strict. A 
function f(@) is concave if — f (0) is convex. Examples of scalar convex functions include 6?, 
e?, and 0 log 6 (for 0 > 0). Examples of scalar concave functions include log(@) and JO. 

Intuitively, a (strictly) convex function has a “bowl shape”, and hence has a unique global 
minimum 6* corresponding to the bottom of the bowl. Hence its second derivative must be 
positive everywhere, a f(0@) > 0. A twice-continuously differentiable, multivariate function f is 
convex iff its Hessian is positive definite for all @.! In the machine learning context, the function 
f often corresponds to the NLL. 


2 
1. Recall that the Hessian is the matrix of second partial derivatives, defined by Hj, = Sane Also, recall that a 
ôk 


matrix H is positive definite iff v? Hv > 0 for any non-zero vector v. 
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Linear data with noise and outliers 


=O- least squares om 
3H = -E = laplace 4 


1 1 1 i Í L 1 1 1 1 
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Figure 7.6 (a) Illustration of robust linear regression. Figure generated by 1inregRobustDemoCombined. 
(b) Illustration of £2, 41, and Huber loss functions. Figure generated by huberLossDemo. 


Models where the NLL is convex are desirable, since this means we can always find the 
globally optimal MLE. We will see many examples of this later in the book. However, many 
models of interest will not have concave likelihoods. In such cases, we will discuss ways to 
derive locally optimal parameter estimates. 


Robust linear regression * 


It is very common to model the noise in regression models using a Gaussian distribution 
with zero mean and constant variance, e; ~ (0,07), where e; = y; — w’ x;. In this case, 
maximizing likelihood is equivalent to minimizing the sum of squared residuals, as we have 
seen. However, if we have outliers in our data, this can result in a poor fit, as illustrated in 
Figure 7.6(a). (The outliers are the points on the bottom of the figure.) This is because squared 
error penalizes deviations quadratically, so points far from the line have more affect on the fit 
than points near to the line. 

One way to achieve robustness to outliers is to replace the Gaussian distribution for the 
response variable with a distribution that has heavy tails. Such a distribution will assign higher 
likelihood to outliers, without having to perturb the straight line to “explain” them. 

One possibility is to use the Laplace distribution, introduced in Section 2.4.3. If we use this 
as our observation model for regression, we get the following likelihood: 


1 
p(ylx,w,b) = Lap(y|w™x,b) x exp(—;ly — w™x]) (7.24) 


The robustness arises from the use of |y—w7x| instead of (y—w7x)?. For simplicity, we will 


assume b is fixed. Let r; = y; — w" x; be the tth residual. The NLL has the form 


(w) = X` |ri(w)| (7.25) 


i 
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Likelihood Prior Name Section 
Gaussian Uniform Least squares 7.3 

Gaussian Gaussian Ridge 7.5 

Gaussian Laplace Lasso 13.3 

Laplace Uniform Robust regression 7.4 

Student Uniform Robust regression Fxercise 11.12 


Table 7.1 Summary of various likelihoods and priors used for linear regression. The likelihood refers to 
the distributional form of p(y|x, w, o°), and the prior refers to the distributional form of p(w). MAP 
estimation with a uniform distribution corresponds to MLE. 


Unfortunately, this is a non-linear objective function, which is hard to optimize. Fortunately, we 
can convert the NLL to a linear objective, subject to linear constraints, using the following split 
variable trick. First we define 


Tå £ r =p (7.26) 


t 


and then we impose the linear inequality constraints that r} > 0 and r7 > 0. Now the 
constrained objective becomes 
min (r= r; ) s.t. rt >O0,r;7 > 0,w? x; + r? +r; = yi (7.27) 
wrt jr 
7 
This is an example of a linear program with D + 2N unknowns and 3N constraints. 
Since this is a convex optimization problem, it has a unique solution. To solve an LP, we must 
first write it in standard form, which as follows: 


min fTO st. AO <b, AgO= bg, 1< 6 <u (7.28) 
In our current example, 0 = (w,r*,r~), f = [0,1,1], A = |], b = [], Ag = [X, 1, —1], 
beg = y, 1 = [-001,0,0], u = []. This can be solved by any LP solver (see e.g., (Boyd and 


Vandenberghe 2004)). See Figure 7.6(a) for an example of the method in action. 
An alternative to using NLL under a Laplace likelihood is to minimize the Huber loss function 
(Huber 1964), defined as follows: 


o r?/2 if |r| < ô 
PRS m -8/2 if|r| >ð e 


This is equivalent to 42 for errors that are smaller than ô, and is equivalent to 44 for larger errors. 
See Figure 7.6(b). The advantage of this loss function is that it is everywhere differentiable, 
using the fact that “|r| = sign(r) if r A 0. We can also check that the function is Cj 
continuous, since the gradients of the two parts of the function match at r = +ô, namely 
4L u(r, ð)|r=5 = 6. Consequently optimizing the Huber loss is much faster than using the 
Laplace likelihood, since we can use standard smooth optimization methods (such as quasi- 
Newton) instead of linear programming. 

Figure 7.6(a) gives an illustration of the Huber loss function. The results are qualitatively 
similiar to the probabilistic methods. (In fact, it turns out that the Huber method also has a 
probabilistic interpretation, although it is rather unnatural (Pontil et al. 1998).) 
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In lambda -20.135 In lambda -8.571 


(a) (b) 


Figure 7.7 Degree 14 Polynomial fit to N = 21 data points with increasing amounts of £2 regularization. 
Data was generated from noise with variance ø? = 4. The error bars, representing the noise variance o°, 
get wider as the fit gets smoother, since we are ascribing more of the data variation to the noise. Figure 
generated by linregPolyVsRegDemo. 


Ridge regression 


One problem with ML estimation is that it can result in overfitting. In this section, we discuss a 
way to ameliorate this problem by using MAP estimation with a Gaussian prior. For simplicity, 
we assume a Gaussian likelihood, rather than a robust likelihood. 


Basic idea 


The reason that the MLE can overfit is that it is picking the parameter values that are the 
best for modeling the training data; but if the data is noisy, such parameters often result in 
complex functions. As a simple example, suppose we fit a degree 14 polynomial to N = 21 data 
points using least squares. The resulting curve is very “wiggly”, as shown in Figure 7.7(a). The 
corresponding least squares coefficients (excluding wg) are as follows: 


6.560, -36.934, -109.255, 543.452, 1022.561, -3046.224, -3768.013, 
8524.540, 6607.897, -12640.058, -5530.188, 9479.730, 1774.639, -2821.526 


We see that there are many large positive and negative numbers. These balance out exactly 
to make the curve “wiggle” in just the right way so that it almost perfectly interpolates the data. 
But this situation is unstable: if we changed the data a little, the coefficients would change a lot. 

We can encourage the parameters to be small, thus resulting in a smoother curve, by using a 
zero-mean Gaussian prior: 


p(w) = [ [ N(w;|0, 7”) (7.30) 
j 


where 1/7? controls the strength of the prior. The corresponding MAP estimation problem 
becomes 


N D 
argmax X` log N(y:|wo + w7 xi, 07) + 5 log N(w;|0, 7°) (7.31) 


i=1 j=l 
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mean squared error 0.9 


25 =E negative log marg. likelihood 
+E} - train mse +X: - CV estimate of MSE 


—< test mse 


log lambda log lambda 


(a) (b) 


Figure 7.8 (a) Training error (dotted blue) and test error (solid red) for a degree 14 polynomial fit by 
ridge regression, plotted vs log(\). Data was generated from noise with variance o° = 4 (training set 
has size N = 21). Note: Models are ordered from complex (small regularizer) on the left to simple (large 
regularizer) on the right. The stars correspond to the values used to plot the functions in Figure 7.7. (b) 
Estimate of performance using training set. Dotted blue: 5-fold cross-validation estimate of future MSE. 
Solid black: negative log marginal likelihood, — log p(D|A). Both curves have been vertically rescaled to 
[0,1] to make them comparable. Figure generated by linregPolyVsRegDemo. 


It is a simple exercise to show that this is equivalent to minimizing the following: 
1 Ty \\2 2 
J(w) = = ù (mi> (wo + w xa)” + All wlle (7.32) 


where À £ o?/r? and || w||3 = 5 w? = w? w is the squared two-norm. Here the first term is 
the MSE/ NLL as usual, and the second term, \ > 0, is a complexity penalty. The corresponding 
solution is given by 


Wridge = (Alp + XTX) 'XTy (7.33) 


This technique is known as ridge regression, or penalized least squares. In general, adding 
a Gaussian prior to the parameters of a model to encourage them to be small is called £2 
regularization or weight decay. Note that the offset term wg is not regularized, since this just 
affects the height of the function, not its complexity. By penalizing the sum of the magnitudes 
of the weights, we ensure the function is simple (since w = O corresponds to a straight line, 
which is the simplest possible function, corresponding to a constant.) 

We illustrate this idea in Figure 7.7, where we see that increasing A results in smoother 
functions. The resulting coefficients also become smaller. For example, using A = 1078, we 
have 
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2.128, 0.807, 16.457, 3.704, -24.948, -10.472, -2.625, 4.360, 13.711, 
10.063, 8.716, 3.966, -9.349, -9.232 


In Figure 7.8(a), we plot the MSE on the training and test sets vs log(A). We see that, as we 
increase À (so the model becomes more constrained), the error on the training set increases. 
For the test set, we see the characteristic U-shaped curve, where the model overfits and then 
underfits. It is common to use cross validation to pick A, as shown in Figure 7.8(b). In 
Section 1.4.8, we will discuss a more probabilistic approach. 

We will consider a variety of different priors in this book. Each of these corresponds to a 
different form of regularization. This technique is very widely used to prevent overfitting. 


Numerically stable computation * 


Interestingly, ridge regression, which works better statistically, is also easier to fit numerically, 
since (AIp + XTX) is much better conditioned (and hence more likely to be invertible) than 
XTX, at least for suitable largy À. 

Nevertheless, inverting matrices is still best avoided, for reasons of numerical stability. (Indeed, 
if you write w=inv (X? * X)*X?’*y in Matlab, it will give you a warning.) We now describe 
a useful trick for fitting ridge regression models (and hence by extension, computing vanilla 
OLS estimates) that is more numerically robust. We assume the prior has the form p(w) = 
N(0,A~'), where A is the precision matrix. In the case of ridge regression, A = (1/7?)I. To 
avoid penalizing the wo term, we should center the data first, as explained in Exercise 7.5. 

First let us augment the original data with some “virtual data” coming from the prior: 


X= CR) ï= Gc (7.34) 


where A = VAVA is a Cholesky decomposition of A. We see that X is (N+ D) x D, 
where the extra rows represent pseudo-data from the prior. 

We now show that the NLL on this expanded data is equivalent to penalized NLL on the 
original data: 


f(w) = (¥-Xw)"(¥ - Xw) , (7.35) 
7 a) = GR) w) (es ~ CA) w) (7.36) 
(7 


(y — xw) i oa) (7.37) 


—VAw —VAw 
= |y- Xw)" - Xw) + (VAw)"(VAw) 738) 
= Sy — Xw)! (y — Xw) + w! Aw (7.39) 


Hence the MAP estimate is given by 
Wridge = (KX) 1 XTY (7.40) 


as we claimed. 


7.5.3 


228 Chapter 7. Linear regression 


Now let 
X =QR (7.41) 


be the QR decomposition of X, where Q is orthonormal (meaning QTQ = QQ? = D, and 
R is upper triangular. Then 


(XTX)! = (R'Q’ QR)! = (RTR)! = RRT (7.42) 
Hence 
Ŵriage = RRR’ O ¢= R'Q (7.43) 


Note that R is easy to invert since it is upper triangular. This gives us a way to compute the 
ridge estimate while avoiding having to invert (A + XTX). 

We can use this technique to find the MLE, by simply computing the QR decomposition of 
the unaugmented matrix X, and using the original y. This is the method of choice for solving 
least squares problems. (In fact, it is so sommon that it can be implemented in one line of 
Matlab, using the backslash operator: w=X\y.) Note that computing the QR decomposition of 
an N x D matrix takes O(N D?) time, and is numerically very stable. 

If D > N, we should first perform an SVD decomposition. In particular, let X = USVT be 
the SVD of X, where VTV = In, UUT = UTU = Iw, and S is a diagonal N x N matrix. 
Now let Z = UD be an N x N matrix. Then we can rewrite the ridge estimate thus: 


Wridge = V(Z7Z+AIy) Zy (7.44) 


In other words, we can replace the D-dimensional vectors x; with the N-dimensional vectors 
zi and perform our penalized fit as before. We then transform the N-dimensional solution 
to the D-dimensional solution by multiplying by V. Geometrically, we are rotating to a new 
coordinate system in which all but the first N coordinates are zero. This does not affect the 
solution since the spherical Gaussian prior is rotationally invariant. The overall time is now 
O(DN7) operations. 


Connection with PCA * 


In this section, we discuss an interesting connection between ridge regression and PCA (Sec- 
tion 12.2), which gives further insight into why ridge regression works well. Our discussion is 
based on (Hastie et al. 2009, p66). 

Let X = USV® be the SVD of X. From Equation 7.44, we have 


Wridge = V (S? + AI)~'SU*y (7.45) 
Hence the ridge predictions on the training set are given by 


¥ = XWridge = USV' V(S? + AI) SUTy (7.46) 


D 
= USU’y = 5 uj 8,0; y (7.47) 
j=l 
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Figure 7.9 Geometry of ridge regression. The likelihood is shown as an ellipse, and the prior is shown 
as a circle centered on the origin. Based on Figure 3.15 of (Bishop 2006b). Figure generated by geomRidge 


where 
a 2 1 oF 
A = j 
Sjj = [S(S°+ADS];; = FA (7.48) 


and g; are the singular values of X. Hence 


D 2 
o? 
y = XWridge = 5 uj zay (7.49) 
ji 


In contrast, the least squares prediction is 
D 
y = Xw, =(USV")(VS"'U’y) = UU’y = X uju} y (7.50) 
j=1 


If o? is small compared to A, then direction u; will not have much effect on the prediction. In 
view of this, we define the effective number of degrees of freedom of the model as follows: 


D 
dof(A) = $` 


j=1 


When à = 0, dof (à) = D, and as A —> oo, dof (à) —> 0. 

Let us try to understand why this behavior is desirable. In Section 7.6, we show that 
cov [w|D] = o?(X7X)~1, if we use a uniform prior for w. Thus the directions in which 
we are most uncertain about w are determined by the eigenvectors of this matrix with the 
smallest eigenvalues, as shown in Figure 4.1. Furthermore, in Section 12.2.3, we show that the 
squared singular values o3 are equal to the eigenvalues of XTX. Hence small singular values o; 
correspond to directions with high posterior variance. It is these directions which ridge shrinks 


the most. 


T: 


2 
j 

2 

oj +À 


(7.51) 
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This process is illustrated in Figure 7.9. The horizontal w parameter is not-well determined 
by the data (has high posterior variance), but the vertical wọ parameter is well-determined. 
Hence w3?” is close to w3"!°, but w7"“” is shifted strongly towards the prior mean, which is 0. 
(Compare to Figure 4.14(c), which illustrated sensor fusion with sensors of different reliabilities.) 
In this way, ill-determined parameters are reduced in size towards 0. This is called shrinkage. 

There is a related, but different, technique called principal components regression. The idea 
is this: first use PCA to reduce the dimensionality to K dimensions, and then use these low 
dimensional features as input to regression. However, this technique does not work as well as 
ridge in terms of predictive accuracy (Hastie et al. 2001, p70). The reason is that in PC regression, 
only the first K (derived) dimensions are retained, and the remaining D — K dimensions are 
entirely ignored. By contrast, ridge regression uses a “soft” weighting of all the dimensions. 


Regularization effects of big data 


Regularization is the most common way to avoid overfitting. However, another effective approach 
— which is not always available — is to use lots of data. It should be intuitively obvious that 
the more training data we have, the better we will be able to learn.? So we expect the test set 
error to decrease to some plateau as N increases. 

This is illustrated in Figure 7.10, where we plot the mean squared error incurred on the test set 
achieved by polynomial regression models of different degrees vs N (a plot of error vs training 
set size is known as a learning curve). The level of the plateau for the test error consists of 
two terms: an irreducible component that all models incur, due to the intrinsic variability of 
the generating process (this is called the noise floor); and a component that depends on the 
discrepancy between the generating process (the “truth’) and the model: this is called structural 
error. 

In Figure 7.10, the truth is a degree 2 polynomial, and we try fitting polynomials of degrees 1, 
2 and 25 to this data. Call the 3 models M1, Mz and Mo;. We see that the structural error 
for models Mə and Məs is zero, since both are able to capture the true generating process. 
However, the structural error for MM, is substantial, which is evident from the fact that the 
plateau occurs high above the noise floor. 

For any model that is expressive enough to capture the truth (i.e., one with small structural 
error), the test error will go to the noise floor as N — oo. However, it will typically go to 
zero faster for simpler models, since there are fewer parameters to estimate. In particular, for 
finite training sets, there will be some discrepancy between the parameters that we estimate 
and the best parameters that we could estimate given the particular model class. This is called 
approximation error, and goes to zero as N — on, but it goes to zero faster for simpler 
models. This is illustrated in Figure 7.10. See also Exercise 7.1. 

In domains with lots of data, simple methods can work surprisingly well (Halevy et al. 2009). 
However, there are still reasons to study more sophisticated learning methods, because there 
will always be problems for which we have little data. For example, even in such a data-rich 
domain as web search, as soon as we want to start personalizing the results, the amount of data 
available for any given user starts to look small again (relative to the complexity of the problem). 


2. This assumes the training data is randomly sampled, and we don't just get repetitions of the same examples. Having 
informatively sampled data can help even more; this is the motivation for an approach known as active learning, where 
you get to choose your training data. 
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Figure 7.10 MSE on training and test sets vs size of training set, for data generated from a degree 2 
polynomial with Gaussian noise of variance 0” = 4. We fit polynomial models of varying degree to this 
data. (a) Degree 1. (b) Degree 2. (c) Degree 10. (d) Degree 25. Note that for small training set sizes, the test 
error of the degree 25 polynomial is higher than that of the degree 2 polynomial, due to overfitting, but 
this difference vanishes once we have enough data. Note also that the degree 1 polynomial is too simple 
and has high test error even given large amounts of training data. Figure generated by linregPolyVsN. 


In such cases, we may want to learn multiple related models at the same time, which is known 
as multi-task learning. This will allow us to “borrow statistical strength” from tasks with lots of 
data and to share it with tasks with little data. We will discuss ways to do later in the book. 


Bayesian linear regression 


Although ridge regression is a useful way to compute a point estimate, sometimes we want to 
compute the full posterior over w and o?. For simplicity, we will initially assume the noise 
variance g? is known, so we focus on computing p(w|D, o°). Then in Section 7.6.3 we consider 
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the general case, where we compute p(w,o?|D). We assume throughout a Gaussian likelihood 
model. Performing Bayesian inference with a robust likelihood is also possible, but requires more 
advanced techniques (see Exercise 24.5). 

Computing the posterior 

In linear regression, the likelihood is given by 


P(y|X,w,u,0°) = N(y|u+Xw,o7ly) (7.52) 


1 
x exp ( 552 (y — uly — Xw)! (y — uly — xw)) (7.53) 


where u is an offset term. If the inputs are centered, so $; x;; = 0 for each j, the mean of the 
output is equally likely to be positive or negative. So let us put an improper prior on p of the 
form p(j) x 1, and then integrate it out to get 


1 
oly|X,w,02) oc exp ( 52 ll¥ — Vl xwl) (7.54) 


where y = ~ DA y; is the empirical mean of the output. For notational simplicity, we shall 
assume the output has been centered, and write y for y — y1y. 

The conjugate prior to the above Gaussian likelihood is also a Gaussian, which we will denote 
by p(w) = N(w|wo, Vo). Using Bayes rule for Gaussians, Equation 4.125, the posterior is given 
by 


p(w|X,y,07) x N(wlwo, Vo)M(y|Xw, o7Iy) = N(w|wn, Vn) (7.55) 
1 
wy = VywV5'wo+ <3 VwX"y (7.56) 
1 
Vy = V+ XX (7.57) 
Oo 
Vn = (oV! +XTX)! (7.58) 


If wo = 0 and Vy = 771, then the posterior mean reduces to the ridge estimate, if we define 
A= a This is because the mean and mode of a Gaussian are the same. 

To gain insight into the posterior distribution (and not just its mode), let us consider a 1D 
example: 


ylz, w) = wo + wiz te (7.59) 


where the “true” parameters are wọ = —0.3 and wı = 0.5. In Figure 7.11 we plot the prior, 
the likelihood, the posterior, and some samples from the posterior predictive. In particular, 
the right hand column plots the function y(x, w‘*)) where x ranges over [—1, 1], and w*) ~ 
N(w|wy, Vy) is a sample from the parameter posterior. Initially, when we sample from the 
prior (first row), our predictions are “all over the place”, since our prior is uniform. After we see 
one data point (second row), our posterior becomes constrained by the corresponding likelihood, 
and our predictions pass close to the observed data. However, we see that the posterior has 
a ridge-like shape, reflecting the fact that there are many possible solutions, with different 
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Figure 7.11 Sequential Bayesian updating of a linear regression model p(y|x) = N (yļwozo + w171, 07). 
Row 0 represents the prior, row 1 represents the first data point (x1, y1), row 2 represents the second 
data point (22, y2), row 3 represents the 20th data point (220, y20). Left column: likelihood function for 
current data point. Middle column: posterior given data so far, p(wW|X1:n, Yi:n) (so the first line is the 
prior). Right column: samples from the current prior/posterior predictive distribution. The white cross in 
columns 1 and 2 represents the true parameter value; we see that the mode of the posterior rapidly (after 
20 samples) converges to this point. The blue circles in column 3 are the observed data points. Based on 
Figure 3.7 of (Bishop 2006a). Figure generated by bayesLinRegDemo2d. 


slopes/intercepts. This makes sense since we cannot uniquely infer two parameters from one 
observation. After we see two data points (third row), the posterior becomes much narrower, 
and our predictions all have similar slopes and intercepts. After we observe 20 data points (last 
row), the posterior is essentially a delta function centered on the true value, indicated by a white 
cross. (The estimate converges to the truth since the data was generated from this model, and 
because Bayes is a consistent estimator; see Section 6.4.1 for discussion of this point.) 


Computing the posterior predictive 


It’s tough to make predictions, especially about the future. — Yogi Berra 
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In machine learning, we often care more about predictions than about interpreting the parame- 
ters. Using Equation 4.126, we can easily show that the posterior predictive distribution at a test 
point x is also Gaussian: 


p(y|x,D,o7) = [Nu w, o°)N(w|wn, Vy)dw (7.60) 
N (ylwnx, on (x)) (7.61) 
oax) = o? +x’ Vyx (7.62) 


The variance in this prediction, at (x), depends on two terms: the variance of the observation 
noise, o”, and the variance in the parameters, V y. The latter translates into variance about 
observations in a way which depends on how close x is to the training data D. This is illustrated 
in Figure 7.12, where we see that the error bars get larger as we move away from the training 
points, representing increased uncertainty. This is important for applications such as active 
learning, where we want to model what we don’t know as well as what we do. By contrast, the 
plugin approximation has constant sized error bars, since 


plul Do?) =f N(ylx™w,0%)du(w)aw = plulx, wo?) 7.63) 
See Figure 7.12(a). 


2 


Bayesian inference when o~ is unknown * 


In this section, we apply the results in Section 4.6.3 to the problem of computing p(w, o7|D) 


for a linear regression model. This generalizes the results from Section 7.6.1 where we assumed 


a? was known. In the case where we use an uninformative prior, we will see some interesting 


connections to frequentist statistics. 


Conjugate prior 
As usual, the likelihood has the form 
p(y|X, w,o7) = N(y|Xw, o7 Ty) (7.64) 


By analogy to Section 4.6.3, one can show that the natural conjugate prior has the following 
form: 


p(w,o”) = NIG(w,o?|wo, Vo, ao, bo) (7.65) 
£ N(w|wo, 07Vo)IG(a?|a9, bo) (7.66) 
2 bo° (2) ~ (ao+(D/2)+1) (7.67) 


(27)/2|Vo|2T (ao) 
(w — wo)! Vo (w — wo) + 2bo 
20? 


xX exp (7.68) 
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Figure 7.12 (a) Plug-in approximation to predictive density (we plug in the MLE of the parameters). (b) 
Posterior predictive density, obtained by integrating out the parameters. Black curve is posterior mean, 
error bars are 2 standard deviations of the posterior predictive density. (c) 10 samples from the plugin 
approximation to posterior predictive. (d) 10 samples from the posterior predictive. Figure generated by 
linregPostPredDemo. 


With this prior and likelihood, one can show that the posterior has the following form: 


p(w,o?|D) = NIG(w,o?|wn, Vn, an, by) (7.69) 
wy = VWy(Vj'wotX’y) (7.70) 
Vy = (V)'+xX?xX)7 (7.71) 
an = ag+n/2 (7.72) 
1 
by = bot 5 (wi Vo “wo + yy — wy Vy wy) (7.73) 


The expressions for wy and V y are similar to the case where ø? is known. The expression for 
ay is also intuitive, since it just updates the counts. The expression for by can be interpreted 
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as follows: it is the prior sum of squares, bo, plus the empirical sum of squares, y?y, plus a 
term due to the error in the prior on w. 
The posterior marginals are as follows: 


plo?’ |D) = IG(ay, bn) (7.74) 
b 
p(w|D) = T(wn, a WN 2an) (7.75) 


We give a worked example of using these equations in Section 7.6.3.3. 
By analogy to Section 4.6.3.6, the posterior predictive distribution is a Student T distribution. 
In particular, given m new test inputs X, we have 


` = b 5 m 
p(y|X,D) = T(¥/Xww, (Im + XVwX"), 2an) (7.76) 
N 


The predictive variance has two components: (by /ayn)Im due to the measurement noise, and 
(by /an)XVnX" due to the uncertainty in w. This latter terms varies depending on how 
close the test inputs are to the training data. 

It is common to set ao = bo = 0, corresponding to an uninformative prior for 7”, and to set 
Wo = 0 and Vo = g(X7X)~! for any positive value g. This is called Zellner’s g-prior (Zellner 
1986). Here g plays a role analogous to 1/A in ridge regression. However, the prior covariance is 
proportional to (X7X)~! rather than I. This ensures that the posterior is invariant to scaling 
of the inputs (Minka 2000b). See also Exercise 7.10. 

We will see below that if we use an uninformative prior, the posterior precision given N 
measurements is Vy = XTX. The unit information prior is defined to contain as much 
information as one sample (Kass and Wasserman 1995). To create a unit information prior for 
linear regression, we need to use Vg t= wx’ X, which is equivalent to the g-prior with 
g=N: 


Uninformative prior 


An uninformative prior can be obtained by considering the uninformative limit of the conjugate 
g-prior, which corresponds to setting g = oo. This is equivalent to an improper NIG prior with 
wo = 0, Vo = oI, ap = 0 and by = 0, which gives p(w, o?) x a~(P+?), 

Alternatively, we can start with the semi-conjugate prior p(w, o°) = p(w)p(o7), and take 
each term to its uninformative limit individually, which gives p(w, o°) x o~?. This is equivalent 
to an improper NIG prior with wọ = 0,V = ool, ap = —D/2 and bo = 0. The corresponding 
posterior is given by 


p(w,o?|D) = NIG(w,o?|ww, Vn, an, bw) (7.77) 

WN = Wmie = (XX) 'XTy (7.78) 

Vw = (XTX)! (7.79) 
N-D 

an = ; (7.80) 

by = - (7.81) 


3? 2 (y = XWŴmie)” (y a XWimie (7.82) 
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Wj i [w; |D] var [w;|D] 95% CI sig 
w0 10.998 3.06027 [4.652, 17.345] * 
wl -0.004 0.00156 [-0.008, -0.001] * 
w2 -0.054 0.02190 [-0.099, -0.008] * 
w3 0.068 0.09947 [-0.138, 0.274] 

w4 -1.294 0.56381 [-2.463, -0.124] * 
wo 0.232 0.10438 (0.015, 0.448] ° 
w6 -0.357 1.56646 [-3.605, 2.892] 

w7 -0.237 1.00601 [-2.324, 1.849] 

w8 0.181 0.23672 [-0.310, 0.672] 

w9 -1.285 0.86485 [-3.079, 0.508] 

wl0 -0.433 0.73487 [-1.957, 1.091] 


Table 7.2 Posterior mean, standard deviation and credible intervals for a linear regression model with an 
uninformative prior fit to the caterpillar data. Produced by linregBayesCaterpillar. 


The marginal distribution of the weights is given by 
2 


p(w|D) = T (wl, peN - D) (7.83) 


where C = (X7X)~! and w is the MLE. We discuss the implications of these equations below. 


An example where Bayesian and frequentist inference coincide * 


The use of a (semi-conjugate) uninformative prior is interesting because the resulting posterior 
turns out to be equivalent to the results from frequentist statistics (see also Section 4.6.3.9). In 
particular, from Equation 7.83 we have 


eee 
C775 


P(w3|D) = T(w3lj, Fp 


N-D) (7.84) 


This is equivalent to the sampling distribution of the MLE which is given by the following (see 
e.g., (Rice 1995, p542), (Casella and Berger 2002, p554): 


Si W typ (7.85) 
Sj 
where 
20, r 
sj = i (7.86) 


is the standard error of the estimated parameter. (See Section 6.2 for a discussion of sampling 
distributions.) Consequently, the frequentist confidence interval and the Bayesian marginal 
credible interval for the parameters are the same in this case. 

As a worked example of this, consider the caterpillar dataset from (Marin and Robert 2007). 
(The details of what the data mean don’t matter for our present purposes.) We can compute 
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the posterior mean and standard deviation, and the 95% credible intervals (CI) for the regression 
coefficients using Equation 7.84. The results are shown in Table 7.2. It is easy to check that these 
95% credible intervals are identical to the 95% confidence intervals computed using standard 
frequentist methods (see linregBayesCaterpillar for the code). 

We can also use these marginal posteriors to compute if the coefficients are “significantly” 
different from 0. An informal way to do this (without using decision theory) is to check if its 95% 
CI excludes 0. From Table 7.2, we see that the CIs for coefficients 0, 1, 2, 4, 5 are all significant 
by this measure, so we put a little star by them. It is easy to check that these results are the 
same as those produced by standard frequentist software packages which compute p-values at 
the 5% level. 

Although the correspondence between the Bayesian and frequentist results might seem ap- 
pealing to some readers, recall from Section 6.6 that frequentist inference is riddled with patholo- 
gies. Also, note that the MLE does not even exist when N < D, so standard frequentist inference 
theory breaks down in this setting. Bayesian inference theory still works, although it requires 
the use of proper priors. (See (Maruyama and George 2008) for one extension of the g-prior to 
the case where D > N.) 


EB for linear regression (evidence procedure) 


So far, we have assumed the prior is known. In this section, we describe an empirical Bayes 
procedure for picking the hyper-parameters. More precisely, we choose 7 = (a, A) to maximize 
the marignal likelihood, where A = 1/0? be the precision of the observation noise and @ is 
the precision of the prior, p(w) = N(w|0,a~1'I). This is known as the evidence procedure 
(MacKay 1995b).? See Section 13.7.4 for the algorithmic details. 

The evidence procedure provides an alternative to using cross validation. For example, in 
Figure 7.13(b), we plot the log marginal likelihood for different values of a, as well as the 
maximum value found by the optimizer. We see that, in this example, we get the same result 
as 5-CV, shown in Figure 7.13(a). (We kept A = 1 /o? fixed in both methods, to make them 
comparable.) 

The principle practical advantage of the evidence procedure over CV will become apparent 
in Section 13.7, where we generalize the prior by allowing a different œ; for every feature. This 
can be used to perform feature selection, using a technique known as automatic relevancy 
determination or ARD. By contrast, it would not be possible to use CV to tune D different 
hyper-parameters. 

The evidence procedure is also useful when comparing different kinds of models, since it 
provides a good approximation to the evidence: 


p(D|m) = J / p(D|w, m)p(w|m, n)p(mim)dwan (7.87) 


~ max p(Djw.m)p(wlm, n)p(nlm)dw (7.88) 


It is important to (at least approximately) integrate over ņ rather than setting it arbitrarily, for 
reasons discussed in Section 5.3.2.5. Indeed, this is the method we used to evaluate the marginal 


3. Alternatively, we could integrate out A analytically, as shown in Section 7.6.3, and just optimize a (Buntine and 
Weigend 1991). However, it turns out that this is less accurate than optimizing both a and A (MacKay 1999). 
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5-fold cross validation, ntrain = 21 log evidence 


-25 -20 -15 -10 -5 0 5 -2 -20 -15 -10 -5 0 5 
log lambda log alpha 


(a) (b) 


Figure 7.13 (a) Estimate of test MSE produced by 5-fold cross-validation vs log(A). The smallest value is 
indicated by the vertical line. Note the vertical scale is in log units. (c) Log marginal likelihood vs log(a). 
The largest value is indicated by the vertical line. Figure generated by linregPolyVsRegDemo. 


likelihood for the polynomial regression models in Figures 5.7 and 5.8. For a “more Bayesian” 
approach, in which we model our uncertainty about 77 rather than computing point estimates, 
see Section 21.5.2. 


Exercises 


Exercise 7.1 Behavior of training set error with increasing sample size 

The error on the test will always decrease as we get more training data, since the model will be better 
estimated. However, as shown in Figure 7.10, for sufficiently complex models, the error on the training set 
can increase we we get more training data, until we reach some plateau. Explain why. 

Exercise 7.2 Multi-output linear regression 

(Source: Jaakkola.) 

When we have multiple independent outputs in linear regression, the model becomes 


M 
p(ylx, W) = [N wslw x95) (7.89) 


j=1 


Since the likelihood factorizes across dimensions, so does the MLE. Thus 


W = [ŵ1,..., Was] (7.90) 


where W; = (X7X)7'Y.,;. 


In this exercise we apply this result to a model with 2 dimensional response vector y; € R°. Suppose we 
have some binary input data, x; € {0,1}. The training data is as follows: 
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y 
(-1, —1) 
S12)" 
=9-—1)" 


-== ocx 


Let us embed each x; into 2d using the following basis function: 

(0) = (1,0)", $0) = (0,1)” (7.91) 
The model becomes 
where W is a 2 x 2 matrix. Compute the MLE for W from the above data. 


Exercise 7.3 Centering and ridge regression 


Assume that X = 0, so the input data has been centered. Show that the optimizer of 


J(w,wo) = (y-—Xw — w01)” (y — Xw — wo1) +Aw' w (7.93) 
is 

i = y (7.94) 

w = (X"7X+AI)'X"y (7.95) 


Exercise 7.4 MLE for o° for linear regression 


Show that the MLE for the error variance in linear regression is given by 


N 
P= gui wy 7.96) 


This is just the empirical variance of the residual errors when we plug in our estimate of w. 


Exercise 7.5 MLE for the offset term in linear regression 


Linear regression has the form E [y|x] = wo + w7x. It is common to include a column of 1’s in the 
design matrix, so we can solve for the offset term wo term and the other parameters w at the same time 
using the normal equations. However, it is also possible to solve for w and wo separately. Show that 


7 1 1 ae ed 
Wo = NY NLM W =U kw (7.97) 


So Wo models the difference in the average output from the average predicted output. Also, show that 


N 


W = (XP Xe) Xi ye = | (xi — X) (xi — =| you — (xi — =) (7.98) 


i=l i=1 


where X. is the centered input matrix containing xj = X; — X along its rows, and ye = y — y is 


the centered output vector. Thus we can first compute w on centered data, and then estimate wo using 


y—xX'w. 
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Exercise 7.6 MLE for simple linear regression 
Simple linear regression refers to the case where the input is scalar, so D = 1. Show that the MLE in 
this case is given by the following equations, which may be familiar from basic statistics classes: 
iti —T)(Yi—Y) _ Li eiyi- NTY _ cov[X,Y] 

P(t: — 7)? >), 2? NT? var [X] 
wo = -w?x E[Y]-wE[X] (7.100) 


wi (7.99) 


See linregDemo1 for a demo. 


Exercise 7.7 Sufficient statistics for online linear regression 


(Source: Jaakkola.) Consider fitting the model 7 = wo + wix using least squares. Unfortunately we did 
not keep the original data, x;, yi, but we do have the following functions (statistics) of the data: 


egy 1, wnis, 7.101 

T m dt y = 2 Yi (7.101) 

o® = Ye- =l He Z)(y—7), OM == B a  @W2 
Á Me i=1 i n i=l “ m $=1 


a. What are the minimal set of statistics that we need to estimate w1? (Hint: see Equation 7.99.) 
b. What are the minimal set of statistics that we need to estimate wo? (Hint: see Equation 7.97.) 


c. Suppose a new data point, £n+1, Yn+1 arrives, and we want to update our sufficient statistics without 
looking at the old data, which we have not stored. (This is useful for online learning.) Show that we 
can this for z as follows. 


28 1 
—(n41) A ( =(n) ) 7.103 
T = 2- = rae nT + Tn+1 (7.103) 
= =n) 4 1 a(n) 7.104 
T Ea g (Zr us) (7.104) 


This has the form: new estimate is old estimate plus correction. We see that the size of the correction 
diminishes over time (i.e., as we get more samples). Derive a similar expression to update 7 
d. Show that one can update core recursively using 
n 1 n —(n)—(n (n —(n 
oe = [en+1Ym41 + now + nEMG™ — (n +1)" ty =) (7.105) 


Derive a similar expression to update Cx2. 


e. Implement the online learning algorithm, i.e., write a function of the form [w,ss] = linregUpdateSS(ss, 
x, y), where x and y are scalars and ss is a structure containing the sufficient statistics. 


f. Plot the coefficients over “time”, using the dataset in linregDemo1. (Specifically, use [x,y] = 
polyDataMake(’sampling’, ’thibaux’).) Check that they converge to the solution given by the 
batch (offline) learner (i.e, ordinary least squares). Your result should look like Figure 7.14. 


Turn in your derivation, code and plot. 


Exercise 7.8 Bayesian linear regression in 1d with known øg? 


(Source: Bolstad.) Consider fitting a model of the form 


plylz, 0) = N (yļwo + wiz, o°) (7.106) 


to the data shown below: 
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Figure 7.14 Regression coefficients over time. Produced by Exercise 7.7. 


[94,96,94,95,104,106,108,113,115,121,131]; 
[0.47, 0.75, 0.83, 0.98, 1.18, 1.29, 1.40, 1.60, 1.75, 1.90, 2.23]; 


a. Compute an unbiased estimate of o° using 


a2 1 cea NB 
= yy Duh) (7.107) 


i=l 


(The denominator is N —2 since we have 2 inputs, namely the offset term and x.) Here ĝi; = Wot+Wixi, 
and w = (to, W1) is the MLE. 


b. Now assume the following prior on w: 


p(w) = p(wo)p(wr) (7.108) 


Use an (improper) uniform prior on wo and a M (0, 1) prior on w1. Show that this can be written as 
a Gaussian prior of the form p(w) = N(w|wo, Vo). What are wo and Vo? 


c. Compute the marginal posterior of the slope, p(wi|D, o”), where D is the data above, and ø? is the 
unbiased estimate computed above. What is E [wi|D, a? | and var [wi ID, o°] Show your work. (You 
can use Matlab if you like.) Hint: the posterior variance is a very small number! 


d. What is a 95% credible interval for w1? 


Exercise 7.9 Generative model for linear regression 


Linear regression is the problem of estimating E[Y |x] using a linear function of the form wo + w”x. 
Typically we assume that the conditional distribution of Y given X is Gaussian. We can either estimate this 
conditional Gaussian directly (a discriminative approach), or we can fit a Gaussian to the joint distribution 
of X,Y and then derive E[Y |X = x]. 


In Exercise 7.5 we showed that the discriminative approach leads to these equations 


E[Y |x] wo + w'x (7.109) 
wo = y-x'w (7.110) 


w = (Xi KJ Xi ve (7.111) 
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where X, = X — X is the centered input matrix, and X=1,x" replicates X across the rows. Similarly, 


Yc = y — Y is the centered output vector, and y = 1ny replicates y across the rows. 


a. By finding the maximum likelihood estimates of Bx x, Uxy, Hy and uy, derive the above equations 
by fitting a joint Gaussian to X, Y and using the formula for conditioning a Gaussian (see Section 4.3.1). 


Show your work. 


b. What are the advantages and disadvantages of this approach compared to the standard discriminative 


approach? 


Exercise 7.10 Bayesian linear regression using the g-prior 


Show that when we use the g-prior, p(w, o°) = NIG(w,o?|0, g(X7X)~',0,0), the posterior has the 


following form: 
p(w, 0° |D) 
Vn 

WN 

aN 


bn 


NIG(w, o°|wn, Vw, an, by) 
g F —1 
a 7 (XTX) 


pe 
N/2 


s? 1 


ae y i Pa 
Wm eX XWm e 
g agg i 


(7.112 
(7.113 


(7.114 
(7.115 


(7.116 


(7.117 


8.1 


8.2 


8.3 


Logistic regression 


Introduction 


One way to build a probabilistic classifier is to create a joint model of the form p(y,x) and 
then to condition on x, thereby deriving p(y|x). This is called the generative approach. An 
alternative approach is to fit a model of the form p(y|x) directly. This is called the discrimi- 
native approach, and is the approach we adopt in this chapter. In particular, we will assume 
discriminative models which are linear in the parameters. This will turn out to significantly sim- 
plify model fitting, as we will see. In Section 8.6, we compare the generative and discriminative 
approaches, and in later chapters, we will consider non-linear and non-parametric discriminative 
models. 


Model specification 


As we discussed in Section 1.4.6, logistic regression corresponds to the following binary classifi- 
cation model: 


p(y|x, w) = Ber(y|sigm(w7x)) (8.1) 


A ld example is shown in Figure 1.19(b). Logistic regression can easily be extended to higher- 
dimensional inputs. For example, Figure 8.1 shows plots of p(y = 1|x,w) = sigm(w/x) for 
2d input and different weight vectors w. If we threshold these probabilities at 0.5, we induce a 
linear decision boundary, whose normal (perpendicular) is given by w. 


Model fitting 


In this section, we discuss algorithms for estimating the parameters of a logistic regression 
model. 
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Figure 8.1 Plots of sigm(wia1 + w2%2). Here w = (w1, w2) defines the normal to the decision 
boundary. Points to the right of this have sigm(w7x) > 0.5, and points to the left have sigm(w7x) < 
0.5. Based on Figure 39.3 of (MacKay 2003). Figure generated by sigmoidplot2D. 


MLE 


The negative log-likelihood for logistic regression is given by 


N 
NLL(w) = =Y log 5® x (1— pr] (8.2) 
i=l 


Di [yi log pi + (1 — yi) log(1 — p)] (8.3) 
i=1 


This is also called the cross-entropy error function (see Section 2.8.2). 
Another way of writing this is as follows. Suppose vi, € {—1, +1} instead of y; € {0,1}. We 


have ply = = j= = Ira Ta and ply = = 1) = Tre (pwTs): Hence 
N 
NLL(w) = 5 log(1 + exp(—giw"x:)) (8.4) 
i=1 


Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we 
need to use an optimization algorithm to compute it. For this, we need to derive the gradient 
and Hessian. 

In the case of logistic regression, one can show (Exercise 8.3) that the gradient and Hessian 
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Figure 8.2 Gradient descent on a simple function, starting from (0,0), for 20 steps, using a fixed 
learning rate (step size) 7. The global minimum is at (1,1). (a) 7 = 0.1. (b) 7 = 0.6. Figure generated by 
steepestDescentDemo. 


of this are given by the following 


— T 
d 
H = Btw)” = > (Vwni)xi = D uill m pi)xix? (8.6) 
= K*sx a 


where S = diag(ju;(1 — ui)). One can also show (Exercise 8.3) that H is positive definite. 
Hence the NLL is convex and has a unique global minimum. Below we discuss some methods 
for finding this minimum. 


Steepest descent 


Perhaps the simplest algorithm for unconstrained optimization is gradient descent, also known 
as steepest descent. This can be written as follows: 


Ok+1 = Ôk — Nek (8.8) 


where ny is the step size or learning rate. The main issue in gradient descent is: how should 
we set the step size? This turns out to be quite tricky. If we use a constant learning rate, but 
make it too small, convergence will be very slow, but if we make it too large, the method can fail 
to converge at all. This is illustrated in Figure 8.2. where we plot the following (convex) function 


f (0) = 0.5(0? — 02)? + 0.5(0, — 1)?, (8.9) 


We arbitrarily decide to start from (0,0). In Figure 8.2(a), we use a fixed step size of 7 = 0.1; we 
see that it moves slowly along the valley. In Figure 8.2(b), we use a fixed step size of 7 = 0.6; we 
see that the algorithm starts oscillating up and down the sides of the valley and never converges 
to the optimum. 
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exact line searching 1 


(a) (b) 


Figure 8.3 (a) Steepest descent on the same function as Figure 8.2, starting from (0,0), using line search. 
Figure generated by steepestDescentDemo. (b) Illustration of the fact that at the end of a line search 
(top of picture), the local gradient of the function will be perpendicular to the search direction. Based on 
Figure 10.6.1 of (Press et al. 1988). 


Let us develop a more stable method for picking the step size, so that the method is guaran- 
teed to converge to a local optimum no matter where we start. (This property is called global 
convergence, which should not be confused with convergence to the global optimum!) By 
Taylor’s theorem, we have 


f(0+nd) ~ f(@)+ng'd (8.10) 


where d is our descent direction. So if 7 is chosen small enough, then f(@+ 7d) < f(@), since 
the gradient will be negative. But we don’t want to choose the step size 7 too small, or we will 
move very slowly and may not reach the minimum. So let us pick 7 to minimize 


b(n) = f(x + ndx) (8.11) 


This is called line minimization or line search. There are various methods for solving this 1d 
optimization problem; see (Nocedal and Wright 2006) for details. 

Figure 8.3(a) demonstrates that line search does indeed work for our simple problem. However, 
we see that the steepest descent path with exact line searches exhibits a characteristic zig-zag 
behavior. To see why, note that an exact line search satisfies nm, = argmin,so¢(n). A 
necessary condition for the optimum is ¢’(7) = 0. By the chain rule, ¢’(7) = dTg, where 
g = f'(0 + 7d) is the gradient at the end of the step. So we either have g = 0, which means 
we have found a stationary point, or g | d, which means that exact search stops at a point 
where the local gradient is perpendicular to the search direction. Hence consecutive directions 
will be orthogonal (see Figure 8.3(b)). This explains the zig-zag behavior. 

One simple heuristic to reduce the effect of zig-zagging is to add a momentum term, (0; — 
6,1), as follows: 


Oky1 = Op — 8k + el On — Ok) (8.12) 
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where 0 < yx < 1 controls the importance of the momentum term. In the optimization 
community, this is known as the heavy ball method (see e.g., (Bertsekas 1999)). 

An alternative way to minimize “zig-zagging” is to use the method of conjugate gradients 
(see e.g., (Nocedal and Wright 2006, ch 5) or (Golub and van Loan 1996, Sec 10.2)). This is the 
method of choice for quadratic objectives of the form f(@) = 07 A@, which arise when solving 
linear systems. However, non-linear CG is less popular. 


Newton’s method 


Algorithm 8.1: Newton’s method for minimizing a strictly convex function 


1 Initialize 09; 

2 for k = 1,2,... until convergence do 

Evaluate g; = V f (0x); 

Evaluate H;, = V? (0x); 

Solve H;,d;, = —g;,, for dy; 

Use line search to find stepsize ną along dx; 
Ok+1 = Ok + Nedk; 


XN 2 on Be w 


One can derive faster optimization methods by taking the curvature of the space (i.e., the 
Hessian) into account. These are called second order optimization metods. The primary 
example is Newton’s algorithm. This is an iterative algorithm which consists of updates of the 
form 


6x41 = Ox — mH; g (8.13) 


The full pseudo-code is given in Algorithm 2. 
This algorithm can be derived as follows. Consider making a second-order Taylor series 
approximation of f(@) around 0x: 


1 

fauaa(®) = fr + g7 (0 — 0%) + 5 (9 0;.)7 Hy. (0 — Ox) (8.14) 
Let us rewrite this as 

fauaa(0) = 07 AO+b'O0+c (8.15) 
where 

1 1 

A= 5 Hk. b = g — Hz0k, c= fk — gi On + 30r HO e (8.16) 

The minimum of fguad is at 
1 
a= =A’ = 6, — Hyg (8.17) 


Thus the Newton step dọ = —H;' gx is what should be added to @;, to minimize the second 
order approximation of f around 0;. See Figure 8.4(a) for an illustration. 
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Figure 8.4 Illustration of Newton’s method for minimizing a ld function. (a) The solid curve is the 
function f(x). The dotted line fguaa(x) is its second order approximation at £4. The Newton step dp 
is what must be added to xx to get to the minimum of fguaa(x). Based on Figure 13.4 of (Vandenberghe 
2006). Figure generated by newtonsMethodMinQuad. (b) Illustration of Newton’s method applied to a 
nonconvex function. We fit a quadratic around the current point x, and move to its stationary point, 
£k+1 = £k +dx. Unfortunately, this is a local maximum, not minimum. This means we need to be careful 
about the extent of our quadratic approximation. Based on Figure 13.11 of (Vandenberghe 2006). Figure 
generated by newtonsMethodNonConvex. 


In its simplest form (as listed), Newton’s method requires that H;, be positive definite, which 
will hold if the function is strictly convex. If not, the objective function is not convex, then 
H, may not be positive definite, so dọ = -H7 tg, may not be a descent direction (see 
Figure 8.4(b) for an example). In this case, one simple strategy is to revert to steepest descent, 
dk = —gx. The Levenberg Marquardt algorithm is an adaptive way to blend between Newton 
steps and steepest descent steps. This method is widely used when solving nonlinear least 
squares problems. An alternative approach is this: Rather than computing dẹ = -H7 'gk 
directly, we can solve the linear system of equations H,d, = —g;,, for dą using conjugate 
gradient (CG). If H% is not positive definite, we can simply truncate the CG iterations as soon 
as negative curvature is detected; this is called truncated Newton. 


Iteratively reweighted least squares (IRLS) 


Let us now apply Newton's algorithm to find the MLE for binary logistic regression. The Newton 
update at iteration k + 1 for this model is as follows (using 7, = 1, since the Hessian is exact): 


Wit = wk- Hg, (8.18) 
= wzy+(X7S,X) XT (y — u) (8.19) 
= (X7S,X)~' [(X7S,X)we + X7 (y — py) (8.20) 
= (X7S,X)-'X? [S Xw +y — py] (8.21) 
= (X?S,X)-!X'S;2, (8.22) 


where we have defined the working response as 


Zk Xw; + S} (y — Mp) (8.23) 
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Equation 8.22 is an example of a weighted least squares problem, which is a minimizer of 
N 
5 SkilZki — WTX)? (8.24) 
i=1 
Since S% is a diagonal matrix, we can rewrite the targets in component form (for each case 
i=1:N)as 


zpi = wlx; + A i (8.25) 
Hki(l — Hki) 
This algorithm is known as iteratively reweighted least squares or IRLS for short, since at 
each iteration, we solve a weighted least squares problem, where the weight matrix S; changes 
at each iteration. See Algorithm 10 for some pseudocode. 


Algorithm 8.2: Iteratively reweighted least squares (IRLS) 


1w=0Op; 

2 wo = log(y/(1 — y)); 
3 repeat 

4 | m= wo ++ wxi; 


5 | mi = sigm(n); 

6 | si= p(l — mi); 

7| 4=mt BE; 

8 S = diag(si:n) ; 

9 | w=(X7SX)-'X7Sz; 
10 until converged; 


Quasi-Newton (variable metric) methods 


The mother of all second-order optimization algorithm is Newton's algorithm, which we dis- 
cussed in Section 8.3.3. Unfortunately, it may be too expensive to compute H explicitly. Quasi- 
Newton methods iteratively build up an approximation to the Hessian using information gleaned 
from the gradient vector at each step. The most common method is called BFGS (named after 
its inventors, Broyden, Fletcher, Goldfarb and Shanno), which updates the approximation to the 
Hessian B} ~ H, as follows: 


yey,  (BxSx)(Brsx)” 


Bk = B+ 8.26 
k+1 k yi sx sT Bi sx ( ) 
Sk = Ok = Ok—ı (8.27) 

Yk = Bk—Bk-1 (8.28) 


This is a rank-two update to the matrix, and ensures that the matrix remains positive definite 
(under certain restrictions on the step size). We typically start with a diagonal approximation, 
Bo = I. Thus BFGS can be thought of as a “diagonal plus low-rank” approximation to the 
Hessian. 


8.3.6 


8.3.7 


252 Chapter 8. Logistic regression 


Alternatively, BFGS can iteratively update an approximation to the inverse Hessian, C, ~ H7 +, 
as follows: 


P T T 
Cki = (1- 2i ) Ch (1 zL) poe (8.29) 
Yg Sk Yg Sk Yg Sk 


Since storing the Hessian takes O(D?) space, for very large problems, one can use limited 
memory BFGS, or L-BFGS, where Hy or H,” is approximated by a diagonal plus low rank 
matrix. In particular, the product H,* gx can be obtained by performing a sequence of inner 
products with są and yx, using only the m most recent (sz, yx) pairs, and ignoring older 
information. The storage requirements are therefore O(mD). Typically m ~ 20 suffices for 
good performance. See (Nocedal and Wright 2006, p177) for more information. L-BFGS is 
often the method of choice for most unconstrained smooth optimization problems that arise in 
machine learning (although see Section 8.5). 


£2 regularization 


Just as we prefer ridge regression to linear regression, so we should prefer MAP estimation for 
logistic regression to computing the MLE. In fact, regularization is important in the classification 
setting even if we have lots of data. To see why, suppose the data is linearly separable. In 
this case, the MLE is obtained when ||w|| — oo, corresponding to an infinitely steep sigmoid 
function, I(w? x > wo), also known as a linear threshold unit. This assigns the maximal 
amount of probability mass to the training data. However, such a solution is very brittle and 
will not generalize well. 

To prevent this, we can use 2 regularization, just as we did with ridge regression. We note 
that the new objective, gradient and Hessian have the following forms: 


fw) = NLL(w)+ Aw’ w (8.30) 
g'(w) = g(w)+Aw (8.31) 
H’(w) = H(w)+AlI (8.32) 


It is a simple matter to pass these modified equations into any gradient-based optimizer. 


Multi-class logistic regression 


Now we consider multinomial logistic regression, sometimes called a maximum entropy 
classifier. This is a model of the form 


exp(w¢ x) 


Co 
Loz XP(Wex) 


A slight variant, known as a conditional logit model, normalizes over a different set of classes 
for each data case; this can be useful for modeling choices that users make between different 
sets of items that are offered to them. 

Let us now introduce some notation. Let pie = p(yi = clxXi, W) = S(n;)c, where n; = 
Wx; is a C x 1 vector. Also, let yie = I(y; = c) be the one-of-C encoding of y;; thus y; is a 
bit vector, in which the c’th bit turns on iff y; = c. Following (Krishnapuram et al. 2005), let us 


ply = cx, W) = (8.33) 
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set wc = 0, to ensure identifiability, and define w = vec(W(:, 1: C — 1)) tobe a Dx (C—1) 
column vector. 
With this, the log-likelihood can be written as 


N C 
ew) = log] [ [fue = 3 3 Vie 108 fic (8.34) 
4=1-e=1. 4=1 c=1 
N € 
= Yi (>: YicWe zx) - log (>: exp( wx; 1 (8.35) 
i=1 L \c=1 
Define the NLL as 
f(w) = —€(w) (8.36) 


We now proceed to compute the gradient and Hessian of this expression. Since w is block- 
structured, the notation gets a bit heavy, but the ideas are simple. It helps to define A & B 
be the kronecker product of matrices A and B. If A is an m x n matrix and B is a p x q 
matrix, then A x B is the mp x nq block matrix 
aB vén ain B 
A8&B= : A : (8.37) 


amı B há amn B 


Returning to the task at hand, one can show (Exercise 8.4) that the gradient is given by 


N 


g(W) = Vi(w) =) (m -y:)8xi (8.38) 


i=l 


where yi = (I(yi = 1),.--,Myi = C — 1)) and m; (W) = [p(y = 1x1, W), PQ = 
C — 1|x;, W)] are coh aie yec of length C — 1, For example, if we have D = 3 feature 
dimensions and C = 3 classes, this becomes 


(Mit — a 
(Mia — Yar) 2a 
W = (Mit — ea 8.39 
g( ) 3 (hiz = Yi2)Uir ( ) 
(hiz T Yi2) Liz 
(hi2 — Yi2) Lig 


In other words, for each class c, the derivative for the weights in the c’th column is 
Vwef(W) = J (ic — Yic)xi (8.40) 
i 
This has the same form as in the binary logistic regression case, namely an error term times xj. 


(This turns out to be a general property of distributions in the exponential family, as we will see 
in Section 9.3.2.) 
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One can also show (Exercise 8.4) that the Hessian is the following block structured D(C — 
1) x D(C — 1) matrix: 
N 
H(W) = V’f(w) => (diag(u,;) — mn?) 8 (x7) (8.41) 


i=l 


For example, if we have 3 features and 3 classes, this becomes 


Ljgti1 Lig%iq Xi2Xi3 (8.42) 


2 ti 
Hw) — ~ (hia — Hi HiHi2 
(W) >, ( —HiiHiz Hi2 — Bey Bigs, Degkeg Lig ig 


Lei, Tilti? | LVj1Li3 
) @ 


(mii = 2, )Xi — hii ig Xi 
i 8.43 
2 ( —Mirmi2X: (piz — Wig) Xi oe 


where X; = x;x/. In other words, the block c, c’ submatrix is given by 


Hee (W) = >> pice — Hio xix? (8.44) 


This is also a positive definite matrix, so there is a unique MLE. 
Now consider minimizing 


f'(W) = —log p(D|w) — log p(W) (8.45) 
where p(W) = [[,.,.V(w-|0, Vo). The new objective, its gradient and Hessian are given by 
FW) = f(Ww)+ >) we Vg We (8.46) 
g'(W) = g(W)+ V (®© wo) (8.47) 
H'(W) = H(W)4+Ic® vo (8.48) 


This can be passed to any gradient-based optimizer to find the MAP estimate. Note, however, 
that the Hessian has size O((CD) x (CD)), which is C times more row and columns than 
in the binary case, so limited memory BFGS is more appropriate than Newton's method. See 
logregFit for some Matlab code. 


Bayesian logistic regression 


It is natural to want to compute the full posterior over the parameters, p(w|D), for logistic 
regression models. This can be useful for any situation where we want to associate confidence 
intervals with our predictions (e.g., this is necessary when solving contextual bandit problems, 
discussed in Section 5.7.3.1). 

Unfortunately, unlike the linear regression case, this cannot be done exactly, since there is no 
convenient conjugate prior for logistic regression. We discuss one simple approximation below; 
some other approaches include MCMC (Section 24.3.3.1), variational inference (Section 21.8.1), 
expectation propagation (Kuss and Rasmussen 2005), etc. For notational simplicity, we stick to 
binary logistic regression. 


8.4.1 


8.4.2 
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Laplace approximation 


In this section, we discuss how to make a Gaussian approximation to a posterior distribution. 
The approximation works as follows. Suppose 0 € RP. Let 


p(@|D) = gee (8.49) 


where Æ(0) is called an energy function, and is equal to the negative log of the unnormal- 
ized log posterior, E(@) = — log p(0, D), with Z = p(D) being the normalization constant. 
Performing a Taylor series expansion around the mode 6” (i.e., the lowest energy state) we get 


1 
E(0) = E(6*) + (0 — 6*)" g4 5 (0 — 6*)" H(6 — 6") (8.50) 
where g is the gradient and H is the Hessian of the energy function evaluated at the mode: 
0? E(0) 
£ VEO) \ 4.5 H£ : (8.51) 
8 (0) 0 30807 lo 
Since 0” is the mode, the gradient term is zero. Hence 
1 * 1 
(OD) & ge ) exp |-5(0 — 6*)"H(0 — 6*) (8.52) 
= N(0|0*, Ht) (8.53) 
Z=p(D) & J p(O|D)dd = e~ 2) (27) P/H] (8.54) 


The last line follows from normalization constant of the multivariate Gaussian. 

Equation 8.54 is known as the Laplace approximation to the marginal likelihood. Therefore 
Equation 8.52 is sometimes called the the Laplace approximation to the posterior. However, 
in the statistics community, the term “Laplace approximation” refers to a more sophisticated 
method (see e.g. (Rue et al. 2009) for details). It may therefore be better to use the term 
“Gaussian approximation” to refer to Equation 8.52. A Gaussian approximation is often a 
reasonable approximation, since posteriors often become more “Gaussian-like” as the sample 
size increases, for reasons analogous to the central limit theorem. (In physics, there is an 
analogous technique known as a saddle point approximation.) 


Derivation of the BIC 


We can use the Gaussian approximation to write the log marginal likelihood as follows, dropping 
irrelevant constants: 


* * 1 
log p(D) ~ log p(P|@*) + log p(@*) — 5 log |H| (8.55) 


The penalization terms which are added to the log p(D|6*) are sometimes called the Occam 
factor, and are a measure of model complexity. If we have a uniform prior, p(@) « 1, we can 
drop the second term, and replace 6” with the MLE, @. 
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We now focus on approximating the third term. We have H = = H;, where H; = 
VV log p(D;|0). Let us approximate each H; by a fixed matrix H. Then we have 


log |H| = log |NH| = log(N“@|H]|) = D log N + log |H| (8.56) 


where D = dim(@) and we have assumed H is full rank. We can drop the log |H| term, since 
it is independent of N, and thus will get overwhelmed by the likelihood. Putting all the pieces 
together, we recover the BIC score (Section 5.3.2.4): 


p D 
log p(D) ~ log p(D|@) — A log N (8.57) 


8.4.3 Gaussian approximation for logistic regression 


Now let us apply the Gaussian approximation to logistic regression. We will use a a Gaussian 
prior of the form p(w) = M (w]0, Vo), just as we did in MAP estimation. The approximate 
posterior is given by 


p(w|D) ~ N(wlw,H~') (8.58) 


where W = arg minw E(w), E(w) = —(log p(D|w) + log p(w)), and H = V? E(w)|w. 

As an example, consider the linearly separable 2D data in Figure 8.5(a). There are many 
parameter settings that correspond to lines that perfectly separate the training data; we show 4 
examples. The likelihood surface is shown in Figure 8.5(b), where we see that the likelihood is 
unbounded as we move up and to the right in parameter space, along a ridge where w2/w, = 
2.35 (this is indicated by the diagonal line). The reasons for this is that we can maximize the 
likelihood by driving ||w|| to infinity (subject to being on this line), since large regression weights 
make the sigmoid function very steep, turning it into a step function. Consequently the MLE is 
not well defined when the data is linearly separable. 

To regularize the problem, let us use a vague spherical prior centered at the origin, M (w|0, 1001). 
Multiplying this spherical prior by the likelihood surface results in a highly skewed posterior, 
shown in Figure 8.5(c). (The posterior is skewed because the likelihood function “chops off” 
regions of parameter space (in a “soft” fashion) which disagree with the data.) The MAP estimate 
is shown by the blue dot. Unlike the MLE, this is not at infinity. 

The Gaussian approximation to this posterior is shown in Figure 8.5(d). We see that this is 
a symmetric distribution, and therefore not a great approximation. Of course, it gets the mode 
correct (by construction), and it at least represents the fact that there is more uncertainty along 
the southwest-northeast direction (which corresponds to uncertainty about the orientation of 
separating lines) than perpendicular to this. Although a crude approximation, this is surely 
better than approximating the posterior by a delta function, which is what MAP estimation does. 


8.4.4 Approximating the posterior predictive 


Given the posterior, we can compute credible intervals, perform hypothesis tests, etc., just as we 
did in Section 7.6.3.3 in the case of linear regression. But in machine learning, interest usually 
focusses on prediction. The posterior predictive distribution has the form 


plylx,D) = J ply 


x, w)p(w|D)dw (8.59) 
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data Log-Likelihood 


(a) 


Figure 8.5 (a) Two-class data in 2d. (b) Log-likelihood for a logistic regression model. The line is drawn 
from the origin in the direction of the MLE (which is at infinity). The numbers correspond to 4 points 
in parameter space, corresponding to the lines in (a). (c) Unnormalized log posterior (assuming vague 
spherical prior). (d) Laplace approximation to posterior. Based on a figure by Mark Girolami. Figure 
generated by logregLaplaceGirolamiDemo. 


Unfortunately this integral is intractable. 
The simplest approximation is the plug-in approximation, which, in the binary case, takes the 
form 


ply =1|x,D) ~ p(y =1|x,E[w)) (8.60) 


where E [w] is the posterior mean. In this context, E [w] is called the Bayes point. Of course, 
such a plug-in estimate underestimates the uncertainty. We discuss some better approximations 
below. 
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(a) 


MC approx of p(y=1|x) 


Figure 8.6 Posterior predictive distribution for a logistic regression model in 2d. Top left: contours of 
p(y = 1|X,Wmap). Top right: samples from the posterior predictive distribution. Bottom left: Averaging 
over these samples. Bottom right: moderated output (probit approximation). Based on a figure by Mark 
Girolami. Figure generated by logregLaplaceGirolamiDemo. 


Monte Carlo approximation 


A better approach is to use a Monte Carlo approximation, as follows: 


S 

ply =1|x, D) ~ = D siem((w°)"x) (8.61) 

s=1 
where w° ~ p(w|D) are samples from the posterior. (This technique can be trivially extended 
to the multi-class case.) If we have approximated the posterior using Monte Carlo, we can reuse 
these samples for prediction. If we made a Gaussian approximation to the posterior, we can 
draw independent samples from the Gaussian using standard methods. 

Figure 8.6(b) shows samples from the posteiror predictive for our 2d example. Figure 8.6(c) 


8.4.4.2 
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Figure 8.7 (a) Posterior predictive density for SAT data. The red circle denotes the posterior mean, the 
blue cross the posterior median, and the blue lines denote the 5th and 95th percentiles of the predictive 
distribution. Figure generated by logregSATdemoBayes. (b) The logistic (sigmoid) function sigm(x) in 
solid red, with the rescaled probit function ®(Ax) in dotted blue superimposed. Here A = \/7/8, which 
was chosen so that the derivatives of the two curves match at x = 0. Based on Figure 4.9 of (Bishop 
2006b). Figure generated by probitPlot. Figure generated by probitRegDemo. 


shows the average of these samples. By averaging over multiple predictions, we see that the 
uncertainty in the decision boundary “splays out” as we move further from the training data. 
So although the decision boundary is linear, the posterior predictive density is not linear. Note 
also that the posterior mean decision boundary is roughly equally far from both classes; this is 
the Bayesian analog of the large margin principle discussed in Section 14.5.2.2. 

Figure 8.7(a) shows an example in ld. The red dots denote the mean of the posterior predictive 
evaluated at the training data. The vertical blue lines denote 95% credible intervals for the 
posterior predictive; the small blue star is the median. We see that, with the Bayesian approach, 
we are able to model our uncertainty about the probability a student will pass the exam based 
on his SAT score, rather than just getting a point estimate. 


Probit approximation (moderated output) * 


If we have a Gaussian approximation to the posterior p(w|D) ~ N(w|my, Vy), we can also 
compute a deterministic approximation to the posterior predictive distribution, at least in the 
binary case. We proceed as follows: 


ply =1|x,D) ~x J semow?x)p(w[D)aw = J sem(QN alna 02)da (8.62) 
a = w'x (8.63) 
la Ê Ela] =m)x (8.64) 
o2 & yar [a] = froe — E [a’°]]da (8.65) 


J oww — (m{x)?’]dw = x? V yx (8.66) 
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Thus we see that we need to evaluate the expectation of a sigmoid with respect to a Gaussian. 
This can be approximated by exploiting the fact that the sigmoid function is similar to the 
probit function, which is given by the cdf of the standard normal: 


(a) = / N (a0, 1)da (8.67) 


Figure 8.7(b) plots the sigmoid and probit functions. We have rescaled the axes so that sigm(a) 
has the same slope as ©()q) at the origin, where A? = 7/8. 
The advantage of using the probit is that one can convolve it with a Gaussian analytically: 


[econ tal o?)da = ® (=r) (8.68) 


We now plug in the approximation sigm(a) ~ (Aa) to both sides of this equation to get 


[semana u,0°*)da ~ sigm(K(o7)p) (8.69) 


NIK 


klo) & (1+07/8)— (8.70) 


Applying this to the logistic regression model we get the following expression (first suggested in 
(Spiegelhalter and Lauritzen 1990)): 


ply =1|x,D) ~ sigm(k(o2)ua) (8.71) 


Figure 8.6(d) indicates that this gives very similar results to the Monte Carlo approximation. 
Using Equation 8.71 is sometimes called a moderated output, since it is less extreme than 
the plug-in estimate. To see this, note that 0 < (o?) < 1 and hence 


sigm(K(o*)u) < sigm(u) = p(y = 1|x, W) (8.72) 


where the inequality is strict if u A 0. If u > 0, we have p(y = 1|x,w) > 0.5, but the 
moderated prediction is always closer to 0.5, so it is less confident. However, the decision 
boundary occurs whenever p(y = 1|x, D) = sigm(«(o7)) = 0.5, which implies u = W?x = 
0. Hence the decision boundary for the moderated approximation is the same as for the plug-in 
approximation. So the number of misclassifications will be the same for the two methods, 
but the log-likelihood will not. (Note that in the multiclass case, taking into account posterior 
covariance gives different answers than the plug-in approach: see Exercise 3.10.3 of (Rasmussen 
and Williams 2006).) 


Residual analysis (outlier detection) * 


It is sometimes useful to detect data cases which are “outliers”. This is called residual analysis 
or case analysis. In a regression setting, this can be performed by computing r; = yi— ĝi, where 
i = W"x;. These values should follow a M (0, 07) distribution, if the modelling assumptions 
are correct. This can be assessed by creating a qq-plot, where we plot the N theoretical 
quantiles of a Gaussian distribution against the N empirical quantiles of the r;. Points that 
deviate from the straightline are potential outliers. 
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Classical methods, based on residuals, do not work well for binary data, because they rely 
on asymptotic normality of the test statistics. However, adopting a Bayesian approach, we 
can just define outliers to be points which which p(y;|§;) is small, where we typically use 
gi = sigm(w?x;). Note that W was estimated from all the data. A better method is to exclude 
(Xi, yi) from the estimate of w when predicting y;. That is, we define outliers to be points 
which have low probability under the cross-validated posterior predictive distribution, defined 


by 


P(Ys|Xi, Xi, Y-i) = J paw) J| e@ilxv, w)p(w)dw (8.73) 
i'żi 


This can be efficiently approximated by sampling methods (Gelfand 1996). For further discussion 
of residual analysis in logistic regression models, see e.g.,Johnson and Albert 1999, Sec 3.4). 


Online learning and stochastic optimization 


Traditionally machine learning is performed offline, which means we have a batch of data, and 
we optimize an equation of the following form 


1 N 
£0) = > Dd £(0,2:) (8.74) 
all 


where z; = (Xi, yi) in the supervised case, or just x; in the unsupervised case, and f(0,z;) is 
some kind of loss function. For example, we might use 


(8, zi) = — log p(yi|xi, 0) (8.75) 
in which case we are trying to maximize the likelihood. Alternatively, we might use 
f(0,2:) = L(yi, h(i, 0)) (8.76) 


where h(x;, 0) is a prediction function, and L(y, %) is some other loss function such as squared 
error or the Huber loss. In frequentist decision theory, the average loss is called the risk (see 
Section 6.3), so this overall approach is called empirical risk minimization or ERM (see Section 6.5 
for details). 

However, if we have streaming data, we need to perform online learning, so we can update 
our estimates as each new data point arrives rather than waiting until “the end” (which may 
never occur). And even if we have a batch of data, we might want to treat it like a stream if it is 
too large to hold in main memory. Below we discuss learning methods for this kind of scenario.! 


1. A simple implementation trick can be used to speed up batch learning algorithms when applied to data sets that 
are too large to hold in memory. First note that the naive implementation makes a pass over the data file, from the 
beginning to end, accumulating the sufficient statistics and gradients as it goes; then an update is performed and the 
process repeats. Unfortunately, at the end of each pass, the data from the beginning of the file will have been evicted 
from the cache (since are are assuming it cannot all fit into memory). Rather than going back to the beginning of the 
file and reloading it, we can simply work backwards from the end of the file, which is already in memory. We then 
repeat this forwards-backwards pattern over the data. This simple trick is known as rocking. 
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Online learning and regret minimization 


Suppose that at each step, “nature” presents a sample zę and the “learner” must respond with 
a parameter estimate 0x. In the theoretical machine learning community, the objective used in 
online learning is the regret, which is the averaged loss incurred relative to the best we could 
have gotten in hindsight using a single fixed parameter value: 


ia 1 
regret, = T f(Ot, z+) — ming XO f (Ox; 21) (8.77) 


For example, imagine we are investing in the stock-market. Let 6; be the amount we invest in 
stock j, and let z; be the return on this stock. Our loss function is f(@,z) = —07z. The regret 
is how much better (or worse) we did by trading at each step, rather than adopting a “buy and 
hold” strategy using an oracle to choose which stocks to buy. 

One simple algorithm for online learning is online gradient descent (Zinkevich 2003), which 
is as follows: at each step k, update the parameters using 


941 = Projo (Ox — Nek) (8.78) 
where projy,(v) = argminwev ||w — v||2 is the projection of vector v onto space V, gẹ = 
Vf(0x,Zk) is the gradient, and 7, is the step size. (The projection step is only needed if 
the parameter must be constrained to live in a certain subset of R?. See Section 13.4.3 for 
details.) Below we will see how this approach to regret minimization relates to more traditional 
objectives, such as MLE. 


There are a variety of other approaches to regret minimization which are beyond the scope 
of this book (see e.g., Cesa-Bianchi and Lugosi (2006) for details). 


Stochastic optimization and risk minimization 


Now suppose that instead of minimizing regret with respect to the past, we want to minimize 
expected loss in the future, as is more common in (frequentist) statistical learning theory. That 
is, we want to minimize 


f(0) =E[f(9, z)] (8.79) 


where the expectation is taken over future data. Optimizing functions where some of the 
variables in the objective are random is called stochastic optimization.” 

Suppose we receive an infinite stream of samples from the distribution. One way to optimize 
stochastic objectives such as Equation 8.79 is to perform the update in Equation 8.78 at each 
step. This is called stochastic gradient descent or SGD (Nemirovski and Yudin 1978). Since we 
typically want a single parameter estimate, we can use a running average: 


6, = > 5 0; (8.80) 


2. Note that in stochastic optimization, the objective is stochastic, and therefore the algorithms will be, too. However, 
it is also possible to apply stochastic optimization algorithms to deterministic objectives. Examples include simulated 
annealing (Section 24.6.1) and stochastic gradient descent applied to the empirical risk minimization problem. There are 
some interesting theoretical connections between online learning and stochastic optimization (Cesa-Bianchi and Lugosi 
2006), but this is beyond the scope of this book. 
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This is called Polyak-Ruppert averaging, and can be implemented recursively as follows: 


oe 1. 
Ox = O41 — gêr- — Ox) (8.81) 


See e.g., (Spall 2003; Kushner and Yin 2003) for details. 


Setting the step size 


We now discuss some sufficient conditions on the learning rate to guarantee convergence of 
SGD. These are known as the Robbins-Monro conditions: 


Song =o, Som < OOo. (8.82) 
k=1 k=1 


The set of values of 7, over time is called the learning rate schedule. Various formulas are 
used, such as 7, = 1/k, or the following (Bottou 1998; Bach and Moulines 2011): 


Nk = (To +k)" (8.83) 


where To > 0 slows down early iterations of the algorithm, and « € (0.5, 1] controls the rate at 
which old values of are forgotten. 

The need to adjust these tuning parameters is one of the main drawback of stochastic 
optimization. One simple heuristic (Bottou 2007) is as follows: store an initial subset of the 
data, and try a range of 7 values on this subset; then choose the one that results in the fastest 
decrease in the objective and apply it to all the rest of the data. Note that this may not result 
in convergence, but the algorithm can be terminated when the performance improvement on a 
hold-out set plateaus (this is called early stopping). 


Per-parameter step sizes 


One drawback of SGD is that it uses the same step size for all parameters. We now briefly 
present a method known as adagrad (short for adaptive gradient) (Duchi et al. 2010), which is 
similar in spirit to a diagonal Hessian approximation. (See also (Schaul et al. 2012) for a similar 
approach.) In particular, if 0;(k) is parameter i at time k, and g;(k) is its gradient, then we 
make an update as follows: 


; AA gi(k) 
bilk + 1) = 0;(k) i a (8.84) 


where the diagonal step size vector is the gradient vector squared, summed over all time steps. 
This can be recursively updated as follows: 


The result is a per-parameter step size that adapts to the curvature of the loss function. This 
method was original derived for the regret minimization case, but it can be applied more 
generally. 
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SGD compared to batch learning 


If we don’t have an infinite data stream, we can “simulate” one by sampling data points at 
random from our training set. Essentially we are optimizing Equation 8.74 by treating it as an 
expectation with respect to the empirical distribution. 


Algorithm 8.3: Stochastic gradient descent 


1 Initialize 0, n; 


2 repeat 

3 Randomly permute data; 
4 for i = 1 : N do 

5 g = V f(0, z); 

6 0 + proje (0 — ng); 
7 Update n; 


8 until converged; 


In theory, we should sample with replacement, although in practice it is usually better to 
randomly permute the data and sample without replacement, and then to repeat. A single such 
pass over the entire data set is called an epoch. See Algorithm 8 for some pseudocode. 

In this offline case, it is often better to compute the gradient of a mini-batch of B data cases. 
If B = 1, this is standard SGD, and if B = N, this is standard steepest descent. Typically 
B ~ 100 is used. 

Although a simple first-order method, SGD performs surprisingly well on some problems, 
especially ones with large data sets (Bottou 2007). The intuitive reason for this is that one can 
get a fairly good estimate of the gradient by looking at just a few examples. Carefully evaluating 
precise gradients using large datasets is often a waste of time, since the algorithm will have 
to recompute the gradient again anyway at the next step. It is often a better use of computer 
time to have a noisy estimate and to move rapidly through parameter space. As an extreme 
example, suppose we double the training set by duplicating every example. Batch methods will 
take twice as long, but online methods will be unaffected, since the direction of the gradient 
has not changed (doubling the size of the data changes the magnitude of the gradient, but that 
is irrelevant, since the gradient is being scaled by the step size anyway). 

In addition to enhanced speed, SGD is often less prone to getting stuck in shallow local 
minima, because it adds a certain amount of “noise”. Consequently it is quite popular in the 
machine learning community for fitting models with non-convex objectives, such as neural 
networks (Section 16.5) and deep belief networks (Section 28.1). 


The LMS algorithm 


As an example of SGD, let us consider how to compute the MLE for linear regression in an 
online fashion. We derived the batch gradient in Equation 7.14. The online gradient at iteration 
k is given by 
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black line = LMS trajectory towards LS soln (red cross) RSS vs iteration 


30 


Figure 8.8 Illustration of the LMS algorithm. Left: we start from @ = (—0.5,2) and slowly converging 
to the least squares solution of O = (1.45, 0.92) (red cross). Right: plot of objective function over time. 
Note that it does not decrease monotonically. Figure generated by LMSdemo. 


where 7 = i(k) is the training example to use at iteration k. If the data set is streaming, we use 
i(k) = k; we shall assume this from now on, for notational simplicity. Equation 8.86 is easy 
to interpret: it is the feature vector x; weighted by the difference between what we predicted, 
Ük = OT xx, and the true response, yx; hence the gradient acts like an error signal. 

After computing the gradient, we take a step along it as follows: 

Ox41 = Ok — Nk (Ôk — YR)Xk (8.87) 
(There is no need for a projection step, since this is an unconstrained optimization problem.) 
This algorithm is called the least mean squares or LMS algorithm, and is also known as the 
delta rule, or the Widrow-Hoff rule. 

Figure 8.8 shows the results of applying this algorithm to the data shown in Figure 7.2. We 
start at 9 = (—0.5, 2) and converge (in the sense that ||@; — @,—1||3 drops below a threshold 
of 107°) in about 26 iterations. 

Note that LMS may require multiple passes through the data to find the optimum. By 
contrast, the recursive least squares algorithm, which is based on the Kalman filter and which 
uses second-order information, finds the optimum in a single pass (see Section 18.2.3). See also 
Exercise 7.7. 


The perceptron algorithm 


Now let us consider how to fit a binary logistic regression model in an online manner. The 
batch gradient was given in Equation 8.5. In the online case, the weight update has the simple 
form 


Ok = On—1 — Mkgi = Oe—1 — Nk (Hi — Yi) Xi (8.88) 


where u; = p(y; = 1|x;, 0k) = E [y;|x;, 0k]. We see that this has exactly the same form as the 
LMS algorithm. Indeed, this property holds for all generalized linear models (Section 9.3). 
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We now consider an approximation to this algorithm. Specifically, let 


ĝi = arg max p(y|xi, 0) (8.89) 
ye{0,1} 

represent the most probable class label. We replace u; = p(y = 1|x;, 0) = sigm(@7x,) in the 

gradient expression with ĝ;. Thus the approximate gradient becomes 


gi © (Hi — Yi) Xi (8.90) 


It will make the algebra simpler if we assume y € {—1, +1} rather than y € {0,1}. In this 
case, our prediction becomes 


ĝi = sign(0" x;) (8.91) 


Then if ĝiyi = —1, we have made an error, but if ĝiyi = +1, we guessed the right label. 

At each step, we update the weight vector by adding on the gradient. The key observation is 
that, if we predicted correctly, then 7; = y;, so the (approximate) gradient is zero and we do 
not change the weight vector. But if x; is misclassified, we update the weights as follows: If 
Ñi = 1 but y; = —1, then the negative gradient is — (ĝi — y;)x; = —2x,; and if g; = —1 but 
yi = 1, then the negative gradient is — (ĝi — y;)x; = 2x;. We can absorb the factor of 2 into 
the learning rate 7 and just write the update, in the case of a misclassification, as 


Ok = On-1 + NkYiXi (8.92) 


Since it is only the sign of the weights that matter, not the magnitude, we will set ną = 1. See 
Algorithm 11 for the pseudocode. 

One can show that this method, known as the perceptron algorithm (Rosenblatt 1958), will 
converge, provided the data is linearly separable, i.e., that there exist parameters 0 such that 
predicting with sign(0" x) achieves 0 error on the training set. However, if the data is not 
linearly separable, the algorithm will not converge, and even if it does converge, it may take 
a long time. There are much better ways to train logistic regression models (such as using 
proper SGD, without the gradient approximation, or IRLS, discussed in Section 8.3.4). However, 
the perceptron algorithm is historically important: it was one of the first machine learning 
algorithms ever derived (by Frank Rosenblatt in 1957), and was even implemented in analog 
hardware. In addition, the algorithm can be used to fit models where computing marginals 
p(yi|x,0) is more expensive than computing the MAP output, arg max, p(y|x, 0); this arises 
in some structured-output classification problems. See Section 19.7 for details. 


A Bayesian view 


Another approach to online learning is to adopt a Bayesian view. This is conceptually quite 
simple: we just apply Bayes rule recursively: 


D(O|Di:n) x p(Dp|O)p(O|DP1:%-1) (8.93) 


This has the obvious advantage of returning a posterior instead of just a point estimate. It also 
allows for the online adaptation of hyper-parameters, which is important since cross-validation 
cannot be used in an online setting. Finally, it has the (less obvious) advantage that it can be 
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Algorithm 8.4: Perceptron algorithm 


1 Input: linearly separable data set x; € RP, y; € {—1, +1} fori =1: N; 
2 Initialize 00; 


3k<0; 

4 repeat 

5 ke k+l; 

6 i<k mod N; 

7 if 7; A y; then 

8 | Ori & On + YiXi 
9 else 

10 | no-op 


u until converged; 


quicker than SGD. To see why, note that by modeling the posterior variance of each parameter 
in addition to its mean, we effectively associate a different learning rate for each parameter 
(de Freitas et al. 2000), which is a simple way to model the curvature of the space. These 
variances can then be adapted using the usual rules of probability theory. By contrast, getting 
second-order optimization methods to work online is more tricky (see e.g., (Schraudolph et al. 
2007; Sunehag et al. 2009; Bordes et al. 2009, 2010)). 

As a simple example, in Section 18.2.3 we show how to use the Kalman filter to fit a linear 
regression model online. Unlike the LMS algorithm, this converges to the optimal (offline) answer 
in a single pass over the data. An extension which can learn a robust non-linear regression model 
in an online fashion is described in (Ting et al. 2010). For the GLM case, we can use an assumed 
density filter (Section 18.5.3), where we approximate the posterior by a Gaussian with a diagonal 
covariance; the variance terms serve as a per-parameter step-size. See Section 18.5.3.2 for details. 
Another approach is to use particle filtering (Section 23.5); this was used in (Andrieu et al. 2000) 
for sequentially learning a kernelized linear/logistic regression model. 


Generative vs discriminative classifiers 


In Section 4.2.2, we showed that the posterior over class labels induced by Gaussian discrim- 
inant analysis (GDA) has exactly the same form as logistic regression, namely p(y = 1|x) = 
sigm(w/x). The decision boundary is therefore a linear function of x in both cases. Note, 
however, that many generative models can give rise to a logistic regression posterior, e.g., if each 
class-conditional density is Poisson, p(x|y = c) = Poi(z|A.). So the assumptions made by GDA 
are much stronger than the assumptions made by logistic regression. 

A further difference between these models is the way they are trained. When fitting a discrim- 
inative model, we usually maximize the conditional log likelihood De log p(y:|xi, 0), whereas 
when fitting a generative model, we usually maximize the joint log likelihood, Eri log p(yi, X:|@). 
It is clear that these can, in general, give different results (see Exercise 4.20). 

When the Gaussian assumptions made by GDA are correct, the model will need less training 
data than logistic regression to achieve a certain level of performance, but if the Gaussian 
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assumptions are incorrect, logistic regression will do better (Ng and Jordan 2002). This is 
because discriminative models do not need to model the distribution of the features. This is 
illustrated in Figure 8.10. We see that the class conditional densities are rather complex; in 
particular, p(a|y = 1) is a multimodal distribution, which might be hard to estimate. However, 
the class posterior, p(y = c|x), is a simple sigmoidal function, centered on the threshold value 
of 0.55. This suggests that, in general, discriminative methods will be more accurate, since their 
“job” is in some sense easier. However, accuracy is not the only important factor when choosing 
a method. Below we discuss some other advantages and disadvantages of each approach. 


Pros and cons of each approach 


e Easy to fit? As we have seen, it is usually very easy to fit generative classifiers. For example, 
in Sections 3.5.1.1 and 4.2.4, we show that we can fit a naive Bayes model and an LDA model 
by simple counting and averaging. By contrast, logistic regression requires solving a convex 
optimization problem (see Section 8.3.4 for the details), which is much slower. 

e Fit classes separately? In a generative classifier, we estimate the parameters of each class 
conditional density independently, so we do not have to retrain the model when we add 
more classes. In contrast, in discriminative models, all the parameters interact, so the whole 
model must be retrained if we add a new class. (This is also the case if we train a generative 
model to maximize a discriminative objective Salojarvi et al. (2005).) 

e Handle missing features easily? Sometimes some of the inputs (components of x) are not 
observed. In a generative classifier, there is a simple method for dealing with this, as we 
discuss in Section 8.6.2. However, in a discriminative classifier, there is no principled solution 
to this problem, since the model assumes that x is always available to be conditioned on 
(although see (Marlin 2008) for some heuristic approaches). 

e Can handle unlabeled training data? There is much interest in semi-supervised learning, 
which uses unlabeled data to help solve a supervised task. This is fairly easy to do using 
generative models (see e.g., (Lasserre et al. 2006; Liang et al. 2007)), but is much harder to do 
with discriminative models. 

e Symmetric in inputs and outputs? We can run a generative model “backwards”, and 
infer probable inputs given the output by computing p(x|y). This is not possible with a 
discriminative model. The reason is that a generative model defines a joint distribution on x 
and y, and hence treats both inputs and outputs symmetrically. 

e Can handle feature preprocessing? A big advantage of discriminative methods is that they 
allow us to preprocess the input in arbitrary ways, e.g., we can replace x with @(x), which 
could be some basis function expansion, as illustrated in Figure 8.9. It is often hard to 
define a generative model on such pre-processed data, since the new features are correlated 
in complex ways. 

e Well-calibrated probabilities? Some generative models, such as naive Bayes, make strong 
independence assumptions which are often not valid. This can result in very extreme poste- 
rior class probabilities (very near 0 or 1). Discriminative models, such as logistic regression, 
are usually better calibrated in terms of their probability estimates. 


We see that there are arguments for and against both kinds of models. It is therefore useful 
to have both kinds in your “toolbox”. See Table 8.1 for a summary of the classification and 
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Figure 8.9 (a) Multinomial logistic regression for 5 classes in the original feature space. (b) After basis 


function expansion, using RBF kernels with a bandwidth of 1, and using all the data points as centers. 


Figure generated by logregMultinomKernelDemo. 
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Figure 8.10 The class-conditional densities p(x|y 
c|x) (right). Based on Figure 1.27 of (Bishop 2006a). 


class posteriors p(y 
generativeVsDiscrim. 


regression techniques we cover in this book. 


Dealing with missing data 

Sometimes some of the inputs (components of x) are not observed; this could be due to a 
sensor failure, or a failure to complete an entry in a survey, etc. This is called the missing data 
problem (Little. and Rubin 1987). The ability to handle missing data in a principled way is one 


of the biggest advantages of generative models. 
To formalize our assumptions, we can associate a binary response variable r; € {0,1}, 
that specifies whether each value x; is observed or not. The joint model has the form 


P(xi, ril0, P) = p(ri|xi, @)p(x:|@), where @ are the parameters controlling whether the item 
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Model Classif/regr  Gen/Discr Param/Non Section 
Discriminant analysis Classif Gen Param Sec. 4.2.2, 4.2.4 
Naive Bayes classifier Classif Gen Param Sec, 3.5, 3.5.1.2 
Tree-augmented Naive Bayes classifier Classif Gen Param Sec. 10.2.1 

Linear regression Regr Discrim Param Sec. 1.4.5, 7.3, 7.6, 
Logistic regression Classif Discrim Param Sec. 1.4.6, 8.3.4, 8.4.3, 21.8.1.1 
Sparse linear/ logistic regression Both Discrim Param Ch. 13 

Mixture of experts Both Discrim Param Sec. 11.2.4 
Multilayer perceptron (MLP)/ Neural network Both Discrim Param Ch. 16 
Conditional random field (CRF) Classif Discrim Param Sec. 19.6 

K nearest neighbor classifier Classif Gen Non Sec. 1.4.2, 14.7.3 
(Infinite) Mixture Discriminant analysis Classif Gen Non Sec. 14.7.3 
Classification and regression trees (CART) Both Discrim Non Sec. 16.2 

Boosted model Both Discrim Non Sec. 16.4 

Sparse kernelized lin/logreg (SKLR) Both Discrim Non Sec. 14.3.2 
Relevance vector machine (RVM) Both Discrim Non Sec. 14.3.2 
Support vector machine (SVM) Both Discrim Non Sec. 14.5 
Gaussian processes (GP) Both Discrim Non Ch. 15 
Smoothing splines Regr Discrim Non Section 15.4.6 


Table 8.1 List of various models for classification and regression which we discuss in this book. Columns 
are as follows: Model name; is the model suitable for classification, regression, or both; is the model 
generative or discriminative; is the model parametric or non-parametric; list of sections in book which 
discuss the model. See also http: //pmtk3.googlecode.com/svn/trunk/docs/tutorial/html/tu 
tSupervised.html for the PMTK equivalents of these models. Any generative probabilistic model (e.g., 
HMMs, Boltzmann machines, Bayesian networks, etc.) can be turned into a classifier by using it as a class 
conditional density. 


is observed or not. If we assume p(r;|x;,@) = p(ri|@), we say the data is missing completely 
at random or MCAR. If we assume p(r;|x;, @) = p(r;|x?, p), where x? is the observed part of 
x,;, we say the data is missing at random or MAR. If neither of these assumptions hold, we say 
the data is not missing at random or NMAR. In this case, we have to model the missing data 
mechanism, since the pattern of missingness is informative about the values of the missing data 
and the corresponding parameters. This is the case in most collaborative filtering problems, for 
example. See e.g., (Marlin 2008) for further discussion. We will henceforth assume the data is 
MAR. 

When dealing with missing data, it is helpful to distinguish the cases when there is missing- 
ness only at test time (so the training data is complete data), from the harder case when there 
is missingness also at training time. We will discuss these two cases below. Note that the class 
label is always missing at test time, by definition; if the class label is also sometimes missing at 
training time, the problem is called semi-supervised learning. 
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Missing data at test time 


In a generative classifier, we can handle features that are MAR by marginalizing them out. For 
example, if we are missing the value of zı, we can compute 


ply = c|x2:0,0) « p(y =cl@)p(x2:p|y = c, 8) (8.94) 
= p(y= cļ0) X plai, xzply =c,0) (8.95) 


If we make the naive Bayes assumption, the marginalization can be performed as follows: 


D 


D 
X plai, tzply = c,0) = [Ere JI 280) = | [ p(2;l0;0) (8.96) 


Tı j=2 


where we exploited the fact that `., p(xı|y = c, 0) = 1. Hence in a naive Bayes classifier, we 
can simply ignore missing features at test time. Similarly, in discriminant analysis, no matter 
what regularization method was used to estimate the parameters, we can always analytically 
marginalize out the missing variables (see Section 4.3): 


p(X2:p\y =C, 0) = N (X2:D|He2:D, Ee2:D, 2D) (8.97) 


Missing data at training time 


Missing data at training time is harder to deal with. In particular, computing the MLE or MAP 
estimate is no longer a simple optimization problem, for reasons discussed in Section 11.3.2. 
However, soon we will study are a variety of more sophisticated algorithms (such as EM algo- 
rithm, in Section 11.4) for finding approximate ML or MAP estimates in such cases. 


Fisher’s linear discriminant analysis (FLDA) * 


Discriminant analysis is a generative approach to classification, which requires fitting an MVN to 
the features. As we have discussed, this can be problematic in high dimensions. An alternative 
approach is to reduce the dimensionality of the features x € R? and then fit an MVN to the 
resulting low-dimensional features z € R”. The simplest approach is to use a linear projection 
matrix, z = Wx, where W is a L x D matrix. One approach to finding W would be to use 
PCA (Section 12.2); the result would be very similar to RDA (Section 4.2.6), since SVD and PCA 
are essentially equivalent. However, PCA is an unsupervised technique that does not take class 
labels into account. Thus the resulting low dimensional features are not necessarily optimal 
for classification, as illustrated in Figure 8.11. An alternative approach is to find the matrix 
W such that the low-dimensional data can be classified as well as possible using a Gaussian 
class-conditional density model. The assumption of Gaussianity is reasonable since we are 
computing linear combinations of (potentially non-Gaussian) features. This approach is called 
Fisher’s linear discriminant analysis, or FLDA. 

FLDA is an interesting hybrid of discriminative and generative techniques. The drawback of 
this technique is that it is restricted to using L < C — 1 dimensions, regardless of D, for reasons 
that we will explain below. In the two-class case, this means we are seeking a single vector w 
onto which we can project the data. Below we derive the optimal w in the two-class case. We 
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Figure 8.11 Example of Fisher's linear discriminant. (a) Two class data in 2D. Dashed green line = first 
principal basis vector. Dotted red line = Fisher's linear discriminant vector. Solid black line joins the 
class-conditional means. (b) Projection of points onto Fisher’s vector shows good class separation. (c) 
Projection of points onto PCA vector shows poor class separation. Figure generated by fisherLDAdemo. 
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then generalize to the multi-class case, and finally we give a probabilistic interpretation of this 
technique. 


Derivation of the optimal ld projection 
We now derive this optimal direction w, for the two-class case, following the presentation of 


(Bishop 2006b, Sec 4.1.4). Define the class-conditional means as 


Hi = > Xi, H2 = 7 Xi (8.98) 
1, 2 


Let Mmk = wT u, be the projection of each mean onto the line w. Also, let z; = w’x; be the 
projection of the data onto the line. The variance of the projected points is proportional to 


s= >> (zi — me)? (8.99) 
iyik 


The goal is to find w such that we maximize the distance between the means, mə — mı, while 
also ensuring the projected clusters are “tight”: 


(m2 — mı)? 


J = 8.100 
(w) eae (8.100) 
We can rewrite the right hand side of the above in terms of w as follows 
w! Spw 
= 8.101 
Je) w! Sww mn 
where Sp is the between-class scatter matrix given by 
Sp = (p — H1)(H2 = ba)” (8.102) 
and Sw is the within-class scatter matrix, given by 
Sw = 5 (xi — pty) (2%; — fy)” + X (Xi — pty) (xi — Ha)” (8.103) 
i=l ifr =2 
To see this, note that 
w” Spw = w” (p — p )(Hz — Hy)” W = (m2 — m1 )(m2 — mı) (8.104 
and 
w Sww = >D w (x: — p) (x: — p)” w + >. w7 (x; — H2) (Xi — fla)’ 8.105) 
“yal i:yi=2 
= So (a-m) + So (a-m) (8.106) 
=l i= 


Equation 8.101 is a ratio of two scalars; we can take its derivative with respect to w and equate 
to zero. One can show (Exercise 12.6) that that J (w) is maximized when 


Spw = \Syww (8.107) 
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Figure 8.12 (a) PCA projection of vowel data to 2d. (b) FLDA projection of vowel data to 2d. We see there 
is better class separation in the FLDA case. Based on Figure 4.11 of (Hastie et al. 2009). Figure generated by 
fisherDiscrimVowelDemo, by Hannes Bretschneider. 


where 
w! Spw 


A= 


= —__ all 
wl Sww (6.108) 


Equation 8.107 is called a generalized eigenvalue problem. If Sw is invertible, we can convert 
it to a regular eigenvalue problem: 


Sp Sgw = Àw (8.109) 
However, in the two class case, there is a simpler solution. In particular, since 


Sew = (Hg — H1 )( H2 — u)” w = (My — p1) (M2 — mı) (8.110) 


then, from Equation 8.109 we have 


Aw = Sy (He — m)(m-— mı) (8.111) 


w x Sy (H-H) (8.112) 


Since we only care about the directionality, and not the scale factor, we can just set 
vasi (6.13) 


This is the optimal solution in the two-class case. If Sw œ I, meaning the pooled covariance 
matrix is isotropic, then w is proportional to the vector that joins the class means. This is an 
intuitively reasonable direction to project onto, as shown in Figure 8.11. 


Extension to higher dimensions and multiple classes 


We can extend the above idea to multiple classes, and to higher dimensional subspaces, by 
finding a projection matrix W which maps from D to L so as to maximize 


_ [WEsWT] 


= 8.114 
[WEw W7] iii 


J(W) 


8.6.3.3 
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where 
A De T 
Dp = > w \(ue— H) (8.115) 
Nes, 
= ZS 
ïy 4 l SX (8.116) 
1 
D = XO (xi — He) (Xi — He)? (8.117) 


Ne 


{iyi =t 
The solution can be shown to be 


W=5y >U (8.118) 


where U are the L leading eigenvectors of DE ain. assuming Sty is non-singular. (If it 
is singular, we can first perform PCA on all the data.) 

Figure 8.12 gives an example of this method applied to some D = 10 dimensional speech 
data, representing C = 11 different vowel sounds. We see that FLDA gives better class separation 
than PCA. 

Note that FLDA is restricted to finding at most a L < C — 1 dimensional linear subspace, 
no matter how large D, because the rank of the between class covariance matrix Hig is C — 1. 
(The -1 term arises because of the yz term, which is a linear function of the jz..) This is a rather 
severe restriction which limits the usefulness of FLDA. 


Probabilistic interpretation of FLDA * 


To find a valid probabilistic interpretation of FLDA, we follow the approach of (Kumar and Andreo 
1998; Zhou et al. 2009). They proposed a model known as heteroscedastic LDA (HLDA), which 
works as follows. Let W be a D x D invertible matrix, and let z; = Wx; be a transformed 
version of the data. We now fit full covariance Gaussians to the transformed data, one per class, 
but with the constraint that only the first L components will be class-specific; the remaining 
H = D—L components will be shared across classes, and will thus not be discriminative. That 
is, we use 


p(zi|8, Yi = c) = N (Zil be, Xe) (8.19) 
u = (m; mo) (8.120) 

A Se (0) 
D = ( 0 a (8.121) 


where mg is the shared H dimensional mean and Sọ is the shared H x H covariace. The pdf 
of the original (untransformed) data is given by 


P(xil Yi =c,W, 0) = [W] N (Wxiļue, x.) (8.122) 

= [W] N(W_x;|m¢, S.) N (W 77xi|mo, So) (8.123) 

where W = ee) For fixed W, it is easy to derive the MLE for 8. One can then optimize 
H 


W using gradient methods. 
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In the special case that the Xe are diagonal, there is a closed-form solution for W (Gales 
1999). And in the special case the Xo are all equal, we recover classical LDA (Zhou et al. 2009). 

In view of this this result, it should be clear that HLDA will outperform LDA if the class 
covariances are not equal within the discriminative subspace (i.e., if the assumption that Xe is 
independent of c is a poor assumption). This is easy to demonstrate on synthetic data, and is 
also the case on more challenging tasks such as speech recognition (Kumar and Andreo 1998). 
Furthermore, we can extend the model by allowing each class to use its own projection matrix; 
this is known as multiple LDA (Gales 2002). 


Exercises 


Exercise 8.1 Spam classification using logistic regression 


Consider the email spam data set discussed on p300 of (Hastie et al. 2009). This consists of 4601 email 
messages, from which 57 features have been extracted. These are as follows: 


e 48 features, in [0,100], giving the percentage of words in a given message which match a given word 


on the list. The list contains words such as “business”, “free”, “george”, etc. (The data was collected by 
George Forman, so his name occurs quite a lot.) 

e 6 features, in [0, 100], giving the percentage of characters in the email that match a given character on 
the list. The characters are ; ( [ ! $ # 

e Feature 55: The average length of an uninterrupted sequence of capital letters (max is 40.3, mean is 4.9) 


e Feature 56: The length of the longest uninterrupted sequence of capital letters (max is 45.0, mean is 
52.6) 


e Feature 57: The sum of the lengts of uninterrupted sequence of capital letters (max is 25.6, mean is 
282.2) 

Load the data from spamData.mat, which contains a training set (of size 3065) and a test set (of size 

1536). 

One can imagine performing several kinds of preprocessing to this data. Try each of the following 

separately: 

a. Standardize the columns so they all have mean 0 and unit variance. 

b. Transform the features using log(x;; + 0.1). 

c. Binarize the features using I(x;; > 0). 

For each version of the data, fit a logistic regression model. Use cross validation to choose the strength 


of the Z2 regularizer. Report the mean error rate on the training and test sets. You should get numbers 
similar to this: 


method | train test 

stnd 0.082 0.079 
log 0.052 0.059 
binary 0.065 0.072 


(The precise values will depend on what regularization value you choose.) Turn in your code and numerical 
results. 


(See also Exercise 8.2. 


Exercise 8.2 Spam classification using naive Bayes 


We will re-examine the dataset from Exercise 8.1. 
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a. Use naiveBayesFit and naiveBayesPredict on the binarized spam data. What is the training and 
test error? (You can try different settings of the pseudocount a if you like (this corresponds to the 
Beta(a, œ) prior each 0;.), although the default of a = 1 is probably fine.) Turn in your error rates. 


b. Modify the code so it can handle real-valued features. Use a Gaussian density for each feature; fit it 
with maximum likelihood. What are the training and test error rates on the standardized data and the 
log transformed data? Turn in your 4 error rates and code. 


Exercise 8.3 Gradient and Hessian of log-likelihood for logistic regression 


a. Let o(a) = oa be the sigmoid function. Show that 
do(a) _ 
TE o(a)(1 — o(a)) (8.124) 


b. Using the previous result and the chain rule of calculus, derive an expression for the gradient of the 
log likelihood (Equation 8.5). 


c. The Hessian can be written as H = X7SX, where S ê diag(uı (1 — m), ---, Hn(1 — Hn)). Show 
that H is positive definite. (You may assume that 0 < ju; < 1, so the elements of S will be strictly 
positive, and that X is full rank.) 


Exercise 8.4 Gradient and Hessian of log-likelihood for multinomial logistic regression 
a. Let Hir = S(1;)x- Prove that the Jacobian of the softmax is 
hik 
Oni 
where kj = I(k = j). 
b. Hence show that 


Vwb = > (Yic— teem (8.126) 


a 


= pik (Onj — Hij) (8.125) 


Hint: use the chain rule and the fact that Dar e=; 


c. Show that the block submatrix of the Hessian for classes c and c’ is given by 


Hee = — >> Mico! — Mie! )Xi xi (8.127) 


Exercise 8.5 Symmetric version of £2 regularized multinomial logistic regression 
(Source: Ex 18.3 of (Hastie et al. 2009).) 


Multiclass logistic regression has the form 


exp(Weo + wx) 


S exp(wro + w7 x) 


p(y = cx, W) = (8.128) 
where W is a (D + 1) x C weight matrix. We can arbitrarily define we = O for one of the classes, say 
c= C, since p(y = C|x, W) = 1 — pp p(y = c|x, w). In this case, the model has the form 


exp(Weo + w?x) 


p(y = cx, W) = = 
ee D exp(wro + wT x) 


(8.129) 
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If we don't “clamp” one of the vectors to some constant value, the parameters will be unidentifiable. 
However, suppose we don't clamp we = 0, so we are using Equation 8.128, but we add ¢2 regularization 
by optimizing 


N Cc 
Slog p(yi|xi, W) -à X || well (8.130) 
t=1 c=1 


Show that at the optimum we have eae We; = 0 for j = 1: D. (For the unregularized wo terms, we 
still need to enforce that woc = 0 to ensure identifiability of the offset.) 


Exercise 8.6 Elementary properties of £2 regularized logistic regression 


(Source: Jaaakkola.). Consider minimizing 


J(w) = —€(w, Dtrain) + Allwl|2 (8.131) 
where 
L(w, D) = D X logøo(yixi w) (8.132) 
iED 


is the average log-likelihood on data set D, for y; E€ {—1, +1}. Answer the following true/ false questions. 


. J(w) has multiple locally optimal solutions: T/F? 
Let W = arg minw J(w) be a global optimum. w is sparse (has many zero entries): T/F? 
. If the training data is linearly separable, then some weights w; might become infinite if A = 0: T/F? 


. €(W, Dtrain) always increases as we increase À: T/F? 


onan p 


. £(W, Dtest) always increases as we increase À: T/F? 


Exercise 8.7 Regularizing separate terms in 2d logistic regression 


(Source: Jaaakkola.) 


a. Consider the data in Figure 8.13, where we fit the model p(y = 1|x, w) = o(wo + wia1 + we%2). 
Suppose we fit the model by maximum likelihood, i.e., we minimize 


J(w) = —€(w, Dtrain) (8.133) 


where (w, Dtrain) is the log likelihood on the training set. Sketch a possible decision boundary 
corresponding to w. (Copy the figure first (a rough sketch is enough), and then superimpose your 
answer on your copy, since you will need multiple versions of this figure). Is your answer (decision 
boundary) unique? How many classification errors does your method make on the training set? 


b. Now suppose we regularize only the wo parameter, i.e., we minimize 
Jo(w) = —l(w, Dirain) + Awg (8.134) 


Suppose A is a very large number, so we regularize wo all the way to 0, but all other parameters are 
unregularized. Sketch a possible decision boundary. How many classification errors does your method 
make on the training set? Hint: consider the behavior of simple linear regression, wo + wia1 + we2x2 
when zı = z2 = 0. 


c. Now suppose we heavily regularize only the wı parameter, i.e., we minimize 
Ji(w) = —0(w, Dirain) + àw? (8.135) 


Sketch a possible decision boundary. How many classification errors does your method make on the 
training set? 
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X 


X, 


Figure 8.13 Data for logistic regression question. 


d. Now suppose we heavily regularize only the w2 parameter. Sketch a possible decision boundary. How 
many classification errors does your method make on the training set? 


9.1 


9.2 


Generalized linear models and the 
exponential family 


Introduction 


We have now encountered a wide variety of probability distributions: the Gaussian, the Bernoulli, 
the Student t, the uniform, the gamma, etc. It turns out that most of these are members of a 
broader class of distributions known as the exponential family.’ In this chapter, we discuss 
various properties of this family. This allows us to derive theorems and algorithms with very 
broad applicability. 

We will see how we can easily use any member of the exponential family as a class-conditional 
density in order to make a generative classifier. In addition, we will discuss how to build 
discriminative models, where the response variable has an exponential family distribution, whose 
mean is a linear function of the inputs; this is known as a generalized linear model, and 
generalizes the idea of logistic regression to other kinds of response variables. 


The exponential family 
Before defining the exponential family, we mention several reasons why it is important: 


e It can be shown that, under certain regularity conditions, the exponential family is the only 
family of distributions with finite-sized sufficient statistics, meaning that we can compress 
the data into a fixed-sized summary without loss of information. This is particularly useful 
for online learning, as we will see later. 

e The exponential family is the only family of distributions for which conjugate priors exist, 
which simplifies the computation of the posterior (see Section 9.2.5). 

e The exponential family can be shown to be the family of distributions that makes the least 
set of assumptions subject to some user-chosen constraints (see Section 9.2.6). 

e The exponential family is at the core of generalized linear models, as discussed in Section 9.3. 

e The exponential family is at the core of variational inference, as discussed in Section 21.2. 


1. The exceptions are the Student t, which does not have the right form, and the uniform distribution, which does not 
have fixed support independent of the parameter values. 


9.2.1 


9.2.2 


9.2.2.1 
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Definition 


A pdf or pmf p(x|@), for x = (z1,..., £m) E€ ¥™ and 0 € © C RS, is said to be in the 
exponential family if it is of the form 


pxo) = zr ele] O1 
= h(x) expl6" d(x) — A(6)] 02) 

where 
20) = f nael" e)a 9.3) 
A(0@) = logZ(0) (9.4) 


Here 0 are called the natural parameters or canonical parameters, ¢(x) € Rĉ is called a 
vector of sufficient statistics, 7(@) is called the partition function, A(@) is called the log 
partition function or cumulant function, and h(x) is the a scaling constant, often 1. If 
(x) = x, we say it is a natural exponential family. 

Equation 9.2 can be generalized by writing 


p(x|8) = h(x) exp[n(0)" (x) — A(n(8))] (9.5) 


where 7) is a function that maps the parameters @ to the canonical parameters n = (0). If 
dim(@) < dim(7(@)), it is called a curved exponential family, which means we have more 
sufficient statistics than parameters. If 7(@) = 0, the model is said to be in canonical form. 
We will assume models are in canonical form unless we state otherwise. 


Examples 


Let us consider some examples to make things clearer. 


Bernoulli 
The Bernoulli for x € {0,1} can be written in exponential family form as follows: 
Ber(2r|y2) = p? (1 — u)! = expl log(n) + (1 — z) log(1 — u)] = expld(a)76] 0.8 


where (x) = [I(x = 0), I(x = 1)] and @ = [log(y), log(1 — jz)]. However, this representation 
is over-complete since there is a linear dependendence between the features: 


17o(z) = I(x = 0) + I(x =1)=1 (9.7) 


Consequently @ is not uniquely identifiable. It is common to require that the representation be 
minimal, which means there is a unique @ associated with the distribution. In this case, we 
can just define 


Ber(z|u) = (1 — u) exp G log G) (9.8) 
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Now we have ¢(2) = x, 0 = log (42): which is the log-odds ratio, and Z = 1/(1— u). We 
can recover the mean parameter u from the canonical parameter using 


1 


— 9.9 
1l+e? (9.9) 


u = sigm(8) = 


Multinoulli 


We can represent the multinoulli as a minimal exponential family as follows (where x, = I(x = 


k)): 


K K 
Cat(x|u) = ii ug” = exp > Tp log «| (9.10) 
k=1 k=1 
K-1 K-1 K-1 
= exp bs Tpk log Hk + (: — 5 n) log(1 — m) (9.11) 
k=1 k=1 k=1 
K-1 i K-1 
k 
= exp 5 £p log ( = + log(1 m) (9.12) 
E t= D Mj k=1 


K-1 
= exp 5. Tk log (=) + log uK (9.13) 
k=1 HK 


where ug = 1 — aaa uk. We can write this in exponential family form as follows: 


Cat(x|0) = exp(@" (x) — A(@)) (9.14) 
0 = jlog = ..., log elcid (9.15) 

HK HK 
lz) = [e= 1),...,I(x= K — 1)] (9.16) 


We can recover the mean parameters from the canonical parameters using 


e?r 
je 2 (9.17) 
1+ So efi 
From this, we find 
i DE 1 an 
HK = = = — i 
1+ Da 1e pE 7) eb; 
and hence 
K-1 
A(@) = log (: +> e) (9.19) 
k=1 


If we define 0x = 0, we can write u = S(@) and A(@) = log Ta e®r, where S is the 


softmax function in Equation 4.39. 


9.2.2.3 


9.2.2.4 
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Univariate Gaussian 


The univariate Gaussian can be written in exponential family form as follows: 


1 1 
N(a|u,0?) = aad ico (9.20) 
T ro P | TAA garl (9.21) 
1 
~ Z(6) exp(0" (2) (9.22) 
where 


2 
y= es (9.23) 


202 
(z) = (2) (9.24) 
we 
Z(u,07) = vV2roexp =] (9.25) 
20? 
d = il ea (9.26) 
= p 20842 g OBT i 


Non-examples 


Not all distributions of interest belong to the exponential family. For example, the uniform 
distribution, X ~ Unif(a,b), does not, since the support of the distribution depends on the 
parameters. Also, the Student T distribution (Section 11.4.5) does not belong, since it does not 
have the required form. 


Log partition function 


An important property of the exponential family is that derivatives of the log partition function 
can be used to generate cumulants of the sufficient statistics. For this reason, A(O) is 
sometimes called a cumulant function. We will prove this for a l-parameter distribution; 
this can be generalized to a K-parameter distribution in a straightforward way. For the first 


2. The first and second cumulants of a distribution are its mean E [X] and variance var [X], whereas the first and 
second moments are its mean E [X] and E [X?]. 
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derivative we have 


a 2 z (o I exp (olo) )hla)de ) 


A d f expl Oolx))h(x)dx 
ea (Ob(a))h(x)dax 
J (z) exp(99(x))h(x)da 
exp(A(@)) 


J o@)exp(6(a) - A@))n(x)ar 


n J Heede =E e) 


For the second derivative we have 

Tr = l p(x) exp (P(x) — A(@)) h(x) (d(x) — A'(0))dx 
= f EME) — A'@))ae 
= [Pwr x)dz — A' (0 0) f oa) 


= E [P (X)] -E [ġ(z)]? = var [$(2)] 
where we used the fact that A’ (0) = 24 = E [¢(x)]. 


d0 
In the multivariate case, we have that 
FPA i , , 
rag T 2 (2)44(2)] - EAE le) 
and hence 


V?A(8) = cov [ġ(x)] 
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(9.27) 


(9.28) 


(9.29) 


(9.30) 


(9.31) 


(9.32) 


(9.33) 


(9.34) 


(9.35) 


(9.36) 


(9.37) 


Since the covariance is positive definite, we see that A(@) is a convex function (see Section 7.3.3). 


Example: the Bernoulli distribution 


For example, consider the Bernoulli distribution. We have A(@) = 
given by 


dA e? 1 : 
eo ise mee A 


The variance is given by 


A d 
do2———‘édd«@ 
e? 1 1 1 


= = =(1 
1lte%l+e-® e4+11+e-9 =p 


tie") Ste) 


log(1 + ef), so the mean is 


(9.38) 


(9.39) 


(9.40) 
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MLE for the exponential family 


The likelihood of an exponential family model has the form 


N 
p(D|A) = i h( x) a0) N exp (er a) (9.41) 


We see that the sufficient statistics are N and 


=i gı (xi), Ta $rlxi)] (9.42) 


For example, for the Bernoulli model we have @ = [$ ; I(x; = 1)], and for the univariate 
Gaussian, we have @ = [X ; xti, 0, x 2]. (We also need to aoe the sample size, N.) 

The Pitman-Koopman-Darmois ihenreni states that, under certain regularity conditions, the 
exponential family is the only family of distributions with finite sufficient statistics. (Here, finite 
means of a size independent of the size of the data set.) 

One of the conditions required in this theorem is that the support of the distribution not be 
dependent on the parameter. For a simple example of such a distribution, consider the uniform 
distribution 


p(z|6) = U(ald) = 310 < x <6) 043) 
The likelihood is given by 

p(D\@) = @-N1(0 < max{z;} < 0) (9.44) 
So the sufficient statistics are N and s(D) = max; x;. This is finite in size, but the uni- 


form distribution is not in the exponential family because its support set, V, depends on the 
parameters. 

We now descibe how to compute the MLE for a canonical exponential family model. Given 
N iid data points D = (21,..., £y), the log-likelihood is 


log p(D|0) = 07 o(D) — NA(0) (9.45) 


Since —A(@) is concave in 0, and 0’ @(D) is linear in 0, we see that the log likelihood is 
concave, and hence has a unique global maximum. To derive this maximum, we use the fact 
that the derivative of the log partition function yields the expected value of the sufficient statistic 
vector (Section 9.2.3): 


Vo log p(D|0) = (D) — NE[(X)] (9.46) 


Setting this gradient to zero, we see that at the MLE, the empirical average of the sufficient 
statistics must equal the model’s theoretical expected sufficient statistics, i.e., 9 must satisfy 


1 N 
lox) =F a (xi) (9.47) 
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This is called moment matching. For example, in the Bernoulli distribution, we have ¢(X) = 
I(X = 1), so the MLE satisfies 


5 [(X)] =X =1) = A= = e= 0.48) 


Bayes for the exponential family * 


We have seen that exact Bayesian analysis is considerably simplified if the prior is conjugate to 
the likelihood. Informally this means that the prior p(@|7) has the same form as the likelihood 
p(D|@). For this to make sense, we require that the likelihood have finite sufficient statistics, so 
that we can write p(D|@) = p(s(D)|@). This suggests that the only family of distributions for 
which conjugate priors exist is the exponential family. We will derive the form of the prior and 
posterior below. 
Likelihood 
The likelihood of the exponential family is given by 

p(D\9) x g(0)™ exp (n(0)"swn) (9.49) 
where sy = ee s(x;). In terms of the canonical parameters this becomes 


p(D|n) x exp(N'n’s — NA(n)) (9.50) 


Prior 
The natural conjugate prior has the form 
p(9\v, To) x g(9)” exp (n(A)" To) (9.51) 


Let us write To = VoTo, to separate out the size of the prior pseudo-data, vo, from the mean of 
the sufficient statistics on this pseudo-data, Tp. In canonical form, the prior becomes 


p(n\vo, Fo) x exp(von’ Fo — vo A(n)) (9.52) 


Posterior 


The posterior is given by 


P(O|D) = p(Olvn, TN) = p(Olvo +N, To + Sw) (9.53) 
So we see that we just update the hyper-parameters by adding. In canonical form, this becomes 
P(n|D) œ exp (n™(voTo + N5) — (vo + N)A(n))) (9.54) 
oTo + NS 
= N, — 9. 
PON ayo (9.55) 


So we see that the posterior hyper-parameters are a convex combination of the prior mean 
hyper-parameters and the average of the sufficient statistics. 


9.2.5.4 
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Posterior predictive density 


Let us derive a generic expression for the predictive density for future observables D’ = 
(X1,---,Xn-) given past data D = (xj,...,xy) as follows. For notational brevity, we 
will combine the sufficient statistics with the size of the data, as follows: 79 = (vo, To), 
s(D) = (N,s(D)), and s(D’) = (N’,s(D’)). So the prior becomes 


jaa e 1 VO p T 
AOF = FEI)” PENE 70) (9.56) 
The likelihood and posterior have a similar form. Hence 
p(D'\D) = | p(D'/a)p(6|D)a0 957 
N’ 
= h(x;)| Z(To + 8(D))~* l g(0) P tN+N do (9.58) 
i=l 


N N’ 
x exp 5 nk(O) (Tk + 5 sk(xi) + 5 sp(Xi) | dO (9.59) 
i=1 i=1 


(9.60) 


II 


If N = 0, this becomes the marginal likelihood of D’, which reduces to the familiar form of 
normalizer of the posterior divided by the normalizer of the prior, multiplied by a constant. 
Example: Bernoulli distribution 


As a simple example, let us revisit the Beta-Bernoulli model in our new notation. 
The likelihood is given by 


6 
= N 
p(D\@) = (1 — 6)” exp (ie; — 7) > zi) (9.61) 
Hence the conjugate prior is given by 


0 
p(O|v,T7) œx (1-— 0)” exp (etp) (9.62) 
— (1 = g)¥o—70 (9.63) 


If we define a = To + 1 and 8 = vo — To + 1, we see that this is a beta distribution. 
We can derive the posterior as follows, where s = )7, I(x; = 1) is the sufficient statistic: 


p(O|D) œx gts(1-— 0) e- mTtn-s (9.64) 
= (10T (9.65) 


We can derive the posterior predictive distribution as follows. Assume p(0) = Beta(0|a, 8), 
and let s = s(D) be the number of heads in the past data. We can predict the probability of a 
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given sequence of future heads, D’ = (%1,...,Zm), with sufficient statistic s” = DA (a; = 
1), as follows: 
1 
p(D'|D) = f p(D'|6|Beta(Olan, Bn)dO (9.66) 
0 
4 1 t 
= — Br) J gontt—1(1 — g)Pntm—t'—lag (9.67) 
T(an)T(8 n) 0 
P(an + Bn) P(Qntm)E(Bn+m) 
= (9.68) 
T(n) (bn) T(Qn4m + Bn+m) 
where 
Anim = Ants’ =a+s+s' (9.69) 
Bn+m = Bn T (m = s") = B + (n = s) + (m <— s') (9.70) 


Maximum entropy derivation of the exponential family * 


Although the exponential family is convenient, is there any deeper justification for its use? It 
turns out that there is: it is the distribution that makes the least number of assumptions about 
the data, subject to a specific set of user-specified constraints, as we explain below. In particular, 
suppose all we know is the expected values of certain features or functions: 


XO fe(x)p(x) = Fr (9.71) 


where Fk are known constants, and f;,(x) is an arbitrary function. The principle of maximum 
entropy or maxent says we should pick the distribution with maximum entropy (closest to 
uniform), subject to the constraints that the moments of the distribution match the empirical 
moments of the specified functions. 

To maximize entropy subject to the constraints in Equation 9.71, and the constraints that 
p(x) > 0 and `, p(x) = 1, we need to use Lagrange multipliers. The Lagrangian is given by 


=- Loe ) log p(x) + do(d — So p(x) + > AF — So v(x) fel) 9.72 
x k x 


We can use the calculus of variations to take derivatives wrt the function p, but we will adopt 
a simpler approach and treat p as a fixed length vector (since we are assuming x is discrete). 
Then we have 


OJ 
Do —1 — log p(x) — ào — 2 Ae fk(x (9.73) 
Setting = 0 yields 


Bets 


J 
(x) = Sex (= Dede (9.74) 
DX Z k frl ; 


9.3 


9.3.1 


290 Chapter 9. Generalized linear models and the exponential family 


ai =l 
S g y 
aA 

Li 


Figure 9.1 A visualization of the various features of a GLM. Based on Figure 8.3 of (Jordan 2007). 


where Z = e!ttào, Using the sum to one constraint, we have 


1 
u dP) =g 3 exp(— 2 Ar fr(x)) (9.75) 
Hence the normalization constant is given by 


Z = X epl Y An fe(x)) (9.76) 
x k 


Thus the maxent distribution p(x) has the form of the exponential family (Section 9.2), also 
known as the Gibbs distribution. 


Generalized linear models (GLMs) 


Linear and logistic regression are examples of generalized linear models, or GLMs (McCullagh 
and Nelder 1989). These are models in which the output density is in the exponential family 
(Section 9.2), and in which the mean parameters are a linear combination of the inputs, passed 
through a possibly nonlinear function, such as the logistic function. We describe GLMs in more 
detail below. We focus on scalar outputs for notational simplicity. (This excludes multinomial 
logistic regression, but this is just to simplify the presentation.) 


Basics 


To understand GLMs, let us first consider the case of an unconditional dstribution for a scalar 
response variable: 


i0 — A(O 
plyilð, o’) = exp HO AO) o(ys.0?) (9.77) 


where g? is the dispersion parameter (often set to 1), @ is the natural parameter, A is the 
partition function, and c is a normalization constant. For example, in the case of logistic 
regression, 0 is the log-odds ratio, 0 = log(z4;)) where u = E[y] = p(y = 1) is the mean 
parameter (see Section 9.2.2.1). To convert from the mean parameter to the natural parameter 
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Distrib. Link g(u) 0 = (uw) u = Y~ (0) =E [y] 
N(u,07) identity =u u=90 

Bin(N, u) logit 0 = log(75;) u= sigm(0) 
Poi(y) log 0 =log(u) p= 


Table 9.1 Canonical link functions % and their inverses for some common GLMs. 


we can use a function W, so 0 = Y (u). This function is uniquely determined by the form of the 
exponential family distribution. In fact, this is an invertible mapping, so we have u = Y71(0). 
Furthermore, we know from Section 9.2.3 that the mean is given by the derivative of the partition 
function, so we have u = Y~1(0) = A’ (0). 

Now let us add inputs/ covariates. We first define a linear function of the inputs: 


pe (9.78) 


We now make the mean of the distribution be some invertible monotonic function of this linear 


combination. By convention, this function, known as the mean function, is denoted by g~!, so 


mi = g7! (m) = 97 '(w" xi) (9.79) 


See Figure 9.1 for a summary of the basic model. 

The inverse of the mean function, namely g(), is called the link function. We are free to 
choose almost any function we like for g, so long as it is invertible, and so long as g~' has the 
appropriate range. For example, in logistic regression, we set p; = g~'(m) = sigm(7;). 

One particularly simple form of link function is to use g = 4; this is called the canonical 
link function. In this case, 0; = n; = wT 'x;, so the model becomes 

yiw! x; — A(w? x;) 


P(yilxi,W,07) = exp 3 + ¢(yi, 07) (9.80) 


In Table 9.1, we list some distributions and their canonical link functions. We see that for the 
Bernoulli/ binomial distribution, the canonical link is the logit function, g(j:) = log(n/(1 —)), 
whose inverse is the logistic function, u = sigm(7). 

Based on the results in Section 9.2.3, we can show that the mean and variance of the response 
variable are as follows: 


E[ylxi,w,o?] = i = A'(G) (9.81) 
var [y|xi,w,o"] = o? = A" (bijo? (9.82) 


To make the notation clearer, let us consider some simple examples. 


e For linear regression, we have 


2 
Yili FL (ye 
Oo 


log p(yilxi,w,0"?) = 5 =) 1og(270?)) (9.83) 


where y; € R, and 0; = p; = w’ x; Here A(0) = 67/2, so E [y;] = p; and var [y;] = o°. 
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e For binomial regression, we have 
N; 
log p(yilxi,w) = yilog a + Nj; log(1 — 7) + log e ) (9.84) 


where y; € {0,1,..., Ni}, Ti = sigm(w” xi), 0; = log(m;/(1 — ™)) = w? x,, and o? = 1. 
Here A(0) = N; log(1 + eô), so E [y;] = Nimi = pi, var [yi] = Niri (1 — m). 


e For poisson regression, we have 


log p(yi| xi; w) = yi log pi — mi — log(y:!) (9.85) 


where y; € {0,1,2,...}, pi = exp(w’x;), 0; = log(u;) = w’x;, and o? = 1. Here 
A(@) = e, so E[y;] = var [y;] = ui. Poisson regression is widely used in bio-statistical 
applications, where y; might represent the number of diseases of a given person or place, 
or the number of reads at a genomic location in a high-throughput sequencing context (see 
e.g., (Kuan et al. 2009)). 


9.3.2 ML and MAP estimation 


One of the appealing properties of GLMs is that they can be fit using exactly the same methods 
that we used to fit logistic regression. In particular, the log-likelihood has the following form: 


N 
1 
{(w) =logp(Dlw) = = >, li (9.86) 
We can compute the gradient vector using the chain rule as follows: 
dé; dé; dO; dui dni 
= i al (9.88) 
dw; dé; dhi dni dwj 
= — A'(0 ij 9.89 
( T dg (9.89) 
= i i L hi 9.90 
If we use a canonical link, 0; = 7;, this simplifies to 
1 [2 
Vwl(w) = = pa m p: (9.91) 


which is a sum of the input vectors, weighted by the errors. This can be used inside a (stochastic) 
gradient descent procedure, discussed in Section 8.5.2. However, for improved efficiency, we 
should use a second-order method. If we use a canonical link, the Hessian is given by 


1 
H=- g = aX SX (9.92) 
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Name Formula 

Logistic Gr = sigm(7) = En 
Probit 1(n) = ®(n) 

Log-log ~*(n) = exp(— exp(—n)) 
Complementary log-log A 1(ņ) = 1 — exp(— exp(n)) 


Table 9.2 Summary of some possible mean functions for binary regression. 


where S = diag( 4h, wie apn) is a diagonal weighting matrix. This can be used inside the 


IRLS algorithm (Section 8.3.4). Specifically, we have the following Newton update: 


Wii = (X7S,X)-!X7S,z, (9.93) 
Zz = &+8,\(y—p,) (9.94) 


where 0; = Xw; and p, = g~1(n;). 

If we extend the derivation to handle non-canonical links, we find that the Hessian has another 
term. However, it turns out that the expected Hessian is the same as in Equation 9.92; using 
the expected Hessian (known as the Fisher information matrix) instead of the actual Hessian is 
known as the Fisher scoring method. 

It is straightforward to modify the above procedure to perform MAP estimation with a Gaus- 
sian prior: we just modify the objective, gradient and Hessian, just as we added 2 regularization 
to logistic regression in Section 8.3.6. 


Bayesian inference 


Bayesian inference for GLMs is usually conducted using MCMC (Chapter 24). Possible methods 
include Metropolis Hastings with an IRLS-based proposal (Gamerman 1997), Gibbs sampling 
using adaptive rejection sampling (ARS) for each full-conditional (Dellaportas and Smith 1993), 
etc. See e.g., (Dey et al. 2000) for futher information. It is also possible to use the Gaussian 
approximation (Section 8.4.1) or variational inference (Section 21.8.1.1). 


Probit regression 


In (binary) logistic regression, we use a model of the form p(y = 1|x;,w) = sigm(w7x;). In 
general, we can write p(y = 1|x;,w) = g~!(w’x;,), for any function g~! that maps [—oo, 00] 
to [0,1]. Several possible mean functions are listed in Table 9.2. 

In this section, we focus on the case where g~+(7) = (n), where ®(7) is the cdf of the 
standard normal. This is known as probit regression. The probit function is very similar to 
the logistic function, as shown in Figure 8.7(b). However, this model has some advantages over 


logistic regression, as we will see. 
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ML/MAP estimation using gradient-based optimization 


We can find the MLE for probit regression using standard gradient methods. Let u; = w’ x;, 
and let J; € {—1, +1}. Then the gradient of the log-likelihod for a specific case is given by 


dpi do ViO(ui) 
dw dpi (Gi pi) 


where œ is the standard normal pdf, and ® is its cdf. Similarly, the Hessian for a single case is 
given by 


ad 


Si = Tw 


log p(jilw™x;) = log p(gi|w" xi) = x (9.95) 


H; 


log p(ii|w? x;) = (9.96) 


olmi)? | imid(ui) \ or 

X; > 3 H a Xi 
(Piui)? Pini) 

We can modify these expressions to compute the MAP estimate in a straightforward manner. In 

particular, if we use the prior p(w) = M (0, Vo), the gradient and Hessian of the penalized 

log likelihood have the form X`; g; + 2V3 'w and >>, H; + 2V3 *. These expressions can be 


passed to any gradient-based optimizer. See probitRegDemo for a demo. 


~ dw? 


Latent variable interpretation 


We can interpret the probit (and logistic) model as follows. First, let us associate each item 
x, with two latent utilities, uo; and u1;, corresponding to the possible choices of y; = 0 and 
yi = 1. We then assume that the observed choice is whichever action has larger utility. More 
precisely, the model is as follows: 


uoi 2 woxit doi (9.97) 
ui = wixi tou (9.98) 
yi = I(un > uo) (9.99) 


where 0’s are error terms, representing all the other factors that might be relevant in decision 
making that we have chosen not to (or are unable to) model. This is called a random utility 
model or RUM (McFadden 1974; Train 2009). 

Since it is only the difference in utilities that matters, let us define z; = u1; — Uoi + €i, where 
ci = 01; — 6o;. If the 6’s have a Gaussian distribution, then so does ¢;. Thus we can write 


a ê wixit+e (9.100) 
ei ~ N(0,1) (9.101) 
y=1 = I(z >20) (9.102) 


Following (Fruhwirth-Schnatter and Fruhwirth 2010), we call this the difference RUM or dRUM 
model. 
When we marginalize out z;, we recover the probit model: 


plyi = lxi; w) = fie > ON (zi|w"x;, 1)dzi (9.103) 


II 
3 
z 
x 
+ 
M 
V 
= 
II 
S 
m 
vV 
| 
z 
a 


(9.104) 
= 1-—4(—w’x;) = ®(w’x,) (9.105) 
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where we used the symmetry of the Gaussian.’ This latent variable interpretation provides an 
alternative way to fit the model, as discussed in Section 11.4.6. 

Interestingly, if we use a Gumbel distribution for the 6’s, we induce a logistic distibution for 
€i, and the model reduces to logistic regression. See Section 24.5.1 for further details. 


Ordinal probit regression * 


One advantage of the latent variable interpretation of probit regression is that it is easy to extend 
to the case where the response variable is ordinal, that is, it can take on C discrete values which 
can be ordered in some way, such as low, medium and high. This is called ordinal regression. 
The basic idea is as follows. We introduce C + 1 thresholds y; and set 


yu=j E y-as ý (9.106) 


where yo < --- < yc. For identifiability reasons, we set yo = —00, y1 = 0 and yo = oo. For 
example, if C = 2, this reduces to the standard binary probit model, whereby z; < 0 produces 
Yi = 0 and z; > 0 produces y; = 1. If C = 3, we partition the real line into 3 intervals: 
(—o0, 0], (0, %2], (Y2,00). We can vary the parameter yz to ensure the right relative amount 
of probability mass falls in each interval, so as to match the empirical frequencies of each class 
label. 

Finding the MLEs for this model is a bit trickier than for binary probit regression, since 
we need to optimize for w and y, and the latter must obey an ordering constraint. See e.g., 
(Kawakatsu and Largey 2009) for an approach based on EM. It is also possible to derive a simple 
Gibbs sampling algorithm for this model (see e.g., (Hoff 2009, p216)). 


Multinomial probit models * 


Now consider the case where the response variable can take on C unordered categorical values, 
yi € {1,...,C}. The multinomial probit model is defined as follows: 


Zie = W' Xie + Cie (9.107) 
€ ~ N(0,R) (9.108) 
Yi = arg max zie (9.109) 


See e.g., (Dow and Endersby 2004; Scott 2009; Fruhwirth-Schnatter and Fruhwirth 2010) for 
more details on the model and its connection to multinomial logistic regression. (By defining 
w = [wi,.-.-,wo], and Xie = [0,...,0,x;,0,...,0], we can recover the more familiar 
formulation Zie = x? w..) Since only relative utilities matter, we constrain R to be a correlation 
matrix. If instead of setting y; = argmax, Zic we use yic = I(zi- > 0), we get a model known 
as multivariate probit, which is one way to model C correlated binary outcomes (see e.g., 
(Talhouk et al. 2011). 


3. Note that the assumption that the Gaussian noise term is zero mean and unit variance is made without loss of 
generality. To see why, suppose we used some other mean p and variance o?. Then we could easily rescale w and add 
an offset term without changing the likelihood. since P(N (0, 1) > —w? x) = P(N (u, 07) > —(w?x + p)/o). 
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Multi-task learning 


Sometimes we want to fit many related classification or regression models. It is often reasonable 
to assume the input-output mapping is similar across these different models, so we can get 
better performance by fitting all the parameters at the same time. In machine learning, this 
setup is often called multi-task learning (Caruana 1998), transfer learning (e.g., (Raina et al. 
2005)), or learning to learn (Thrun and Pratt 1997). In statistics, this is usually tackled using 
hierarchical Bayesian models (Bakker and Heskes 2003), as we discuss below, although there are 
other possible methods (see e.g., (Chai 2010)). 


Hierarchical Bayes for multi-task learning 


Let y;; be the response of the i'th item in group j, for i = 1: Nj and j = 1: J. For example, 
j might index schools, i might index students within a school, and y,; might be the test score, 
as in Section 5.6.2. Or j might index people, and i might index purchaes, and y;; might be 
the identity of the item that was purchased (this is known as discrete choice modeling (Train 
2009)). Let x;; be a feature vector associated with y;;. The goal is to fit the models p(y;|x;) 
for all j. 

Although some groups may have lots of data, there is often a long tail, where the majority 
of groups have little data. Thus we can't reliably fit each model separately, but we don’t want 
to use the same model for all groups. As a compromise, we can fit a separate model for 
each group, but encourage the model parameters to be similar across groups. More precisely, 
suppose E [y;;|xi;] = gx; B;), where g is the link function for the GLM. Furthermore, suppose 
Bi ~ N(G,,071), and that 6, ~ N(u,02I). In this model, groups with small sample 
size borrow statistical strength from the groups with larger sample size, because the (,’s are 
correlated via the latent common parents 6, (see Section 5.5 for further discussion of this point). 
The term c? controls how much group j depends on the common parents and the o? term 
controls the strength of the overall prior. 

Suppose, for simplicity, that u = 0, and that oF and g2 are all known (e.g., they could be set 
by cross validation). The overall log probability has the form 


18; - Bell? | I6 


2 2 
205 202 


log p(D|B) + log p(B) = X |log p(D;|B;) (9.110) 

j 
We can perform MAP estimation of B = ((,.;,,) using standard gradient methods. Alter- 
natively, we can perform an iterative optimization scheme, alternating between optimizing the 
Bj and the 6,; since the likelihood and prior are convex, this is guaranteed to converge to the 
global optimum. Note that once the models are trained, we can discard 3,,, and use each model 
separately. 


Application to personalized email spam filtering 


An interesting application of multi-task learning is personalized spam filtering. Suppose we 
want to fit one classifier per user, B;. Since most users do not label their email as spam or not, 
it will be hard to estimate these models independently. So we will let the 4; have a common 
prior @,,, representing the parameters of a generic user. 
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In this case, we can emulate the behavior of the above model with a simple trick (Daume 
2007b; Attenberg et al. 2009; Weinberger et al. 2009): we make two copies of each feature x;, 
one concatenated with the user id, and one not. The effect will be to learn a predictor of the 
form 


) [yi|x;, ul = (B, Wi, ža wy)” [xi, I(u = 1)x;, a ,I(u = J)x;i] (9.11) 


where u is the user id. In other words, 


E [yixin u = j] = (BL + w;)? x; (9.112) 


Thus 6, will be estimated from everyone's email, whereas w; will just be estimated from user 
j’s email. 

To see the correspondence with the above hierarchical Bayesian model, define w; = 8; — 6,- 
Then the log probability of the original model can be rewritten as 


lwll? B? 
2 log p(D;|B. + wj) 20? 70? (9.113) 
If we assume o? = ae, the effect is the same as using the augmented feature trick, with the 


same regularizer strength for both w; and G,,. However, one typically gets better performance 
by not requiring that oF be equal to a? (Finkel and Manning 2009). 


Application to domain adaptation 


Domain adaptation is the problem of training a set of classifiers on data drawn from different 
distributions, such as email and newswire text. This problem is obviously a special case of 
multi-task learning, where the tasks are the same. 

(Finkel and Manning 2009) used the above hierarchical Bayesian model to perform domain 
adaptation for two NLP tasks, namely named entity recognition and parsing. They report reason- 
ably large improvements over fitting separate models to each dataset, and small improvements 
over the approach of pooling all the data and fitting a single model. 


Other kinds of prior 


In multi-task learning, it is common to assume that the prior is Gaussian. However, sometimes 
other priors are more suitable. For example, consider the task of conjoint analysis, which 
requires figuring out which features of a product customers like best. This can be modelled 
using the same hierarchical Bayesian setup as above, but where we use a sparsity-promoting 
prior on ĝ;, rather than a Gaussian prior. This is called multi-task feature selection. See e.g., 
(Lenk et al. 1996; Argyriou et al. 2008) for some possible approaches. 

It is not always reasonable to assume that all tasks are all equally similar. If we pool the 
parameters across tasks that are qualitatively different, the performance will be worse than not 
using pooling, because the inductive bias of our prior is wrong. Indeed, it has been found 
experimentally that sometimes multi-task learning does worse than solving each task separately 
(this is called negative transfer). 
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One way around this problem is to use a more flexible prior, such as a mixture of Gaussians. 
Such flexible priors can provide robustness against prior mis-specification. See e.g., (Xue et al. 
2007; Jacob et al. 2008) for details. One can of course combine mixtures with sparsity-promoting 
priors (Ji et al. 2009). Many other variants are possible. 


Generalized linear mixed models * 


Suppose we generalize the multi-task learning scenario to allow the response to include infor- 
mation at the group level, x;, as well as at the item level, x;j. Similarly, we can allow the 
parameters to vary across groups, 3,, or to be tied across groups, œ. This gives rise to the 
following model: 


= [vig |Xej Xj] = 9 (Pr (xi) B; + olx)" By + G3 (xiz) + plx) a’) (9.114) 


where the @; are basis functions. This model can be represented pictorially as shown in 
Figure 9.2(a). (Such figures will be explained in Chapter 10.) Note that the number of Bj 
parameters grows with the number of groups, whereas the size of œ is fixed. 

Frequentists call the terms 3; random effects, since they vary randomly across groups, but 
they call œ a fixed effect, since it is viewed as a fixed but unknown constant. A model with 
both fixed and random effects is called a mixed model. If p(y|x) is a GLM, the overall model 
is called a generalized linear mixed effects model or GLMM. Such models are widely used in 
statistics. 


j’ 


Example: semi-parametric GLMMs for medical data 


Consider the following example from (Wand 2009). Suppose y;; is the amount of spinal bone 
mineral density (SBMD) for person j at measurement 7. Let xi; be the age of person, and let 
x; be their ethnicity, which can be one of: White, Asian, Black, or Hispanic. The primary goal 
is to determine if there are significant differences in the mean SBMD among the four ethnic 
groups, after accounting for age. The data is shown in the light gray lines in Figure 9.2(b). We 
see that there is a nonlinear effect of SBMD vs age, so we will use a semi-parametric model 
which combines linear regression with non-parametric regression (Ruppert et al. 2003). We also 
see that there is variation across individuals within each group, so we will use a mixed effects 
model. Specifically, we will use @,(x;:;) = 1 to account for the random effect of each person; 
2(xi;) = 0 since no other coefficients are person-specific; (xij) = [bx (xi;)], where bg is 
the k’th spline basis functions (see Section 15.4.6.2), to account for the nonlinear effect of age; 
and $,(2;) = I(x; = w), I(x; = a), I(x; = b), I(x; = h)] to account for the effect of the 
different ethnicities. Furthermore, we use a linear link function. The overall model is therefore 


S[yijleig,2j] = Bj +a b(aiz) + eij (9.115) 
+a I(x; = w) +a, l(a; =a)+a,l(a; = b) +a I(x; =h) (9.116) 


where cij ~ N (0, o2). œ contains the non-parametric part of the model related to age, a’ 
contains the parametric part of the model related to ethnicity, and 8; is a random offset 
for person j. We endow all of these regression coefficients with separate Gaussian priors. 
We can then perform posterior inference to compute p(@, a’, 3,07|D) (see Section 9.6.2 for 
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Figure 9.2 (a) Directed graphical model for generalized linear mixed effects model with J groups. (b) 
Spinal bone mineral density vs age for four different ethnic groups. Raw data is shown in the light gray 
lines. Fitted model shown in black (solid is the posterior predicted mean, dotted is the posterior predictive 


variance). From Figure 9 of (Wand 2009). Used with kind permission of Matt Wand 
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computational details). After fitting the model, we can compute the prediction for each group. 
See Figure 9.2(b) for the results. We can also perform significance testing, by computing p(a, — 
Qw|DP) for each ethnic group g relative to some baseline (say, White), as we did in Section 5.2.3. 


Computational issues 


The principle problem with GLMMs is that they can be difficult to fit, for two reasons. First, 
p(yij|@) may not be conjugate to the prior p(@) where @ = (a, 3). Second, there are two levels 
of unknowns in the model, namely the regression coefficients @ and the means and variances 
of the priors 7 = (u, 0). 

One approach is to adopt fully Bayesian inference methods, such as variational Bayes (Hall 
et al. 2011) or MCMC (Gelman and Hill 2007). We discuss VB in Section 21.5, and MCMC in 
Section 24.1. 

An alternative approach is to use empirical Bayes, which we discuss in general terms in 
Section 5.6. In the context of a GLMM, we can use the EM algorithm (Section 11.4), where in the 
E step we compute p(8|n, D), and in the M step we optimize n. If the linear regression setting, 
the E step can be performed exactly, but in general we need to use approximations. Traditional 
methods use numerical quadrature or Monte Carlo (see e.g., (Breslow and Clayton 1993)). A 
faster approach is to use variational EM; see (Braun and McAuliffe 2010) for an application of 
variational EM to a multi-level discrete choice modeling problem. 

In frequentist statistics, there is a popular method for fitting GLMMs called generalized 
estimating equations or GEE (Hardin and Hilbe 2003). However, we do not recommend this 
approach, since it is not as statistically efficient as likelihood-based methods (see Section 6.4.3). 
In addition, it can only provide estimates of the population parameters a, but not the random 
effects B; which are sometimes of interest in themselves. 


Learning to rank * 


In this section, we discuss the learning to rank or LETOR problem. That is, we want to learn a 
function that can rank order a set of items (we will be more precise below). The most common 
application is to information retrieval. Specifically, suppose we have a query q and a set of 
documents d!,...,d™ that might be relevant to q (e.g., all documents that contain the string q). 
We would like to sort these documents in decreasing order of relevance and show the top k to 
the user. Similar problems arise in other areas, such as collaborative filtering. (Ranking players 
in a game or tournament setting is a slightly different kind of problem; see Section 22.5.5.) 

Below we summarize some methods for solving this problem, following the presentation of 
(Liu 2009). This material is not based on GLMs, but we include it in this chapter anyway for 
lack of a better place. 

A standard way to measure the relevance of a document d to a query q is to use a probabilistic 
language model based on a bag of words model. That is, we define sim(q,d) ê p(q\d) = 
II; p(ald), where qi is the ith word or term, and p(qi|d) is a multinoulli distribution 
estimated from document d. In practice, we need to smooth the estimated distribution, for 
example by using a Dirichlet prior, representing the overall frequency of each word. This can be 
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estimated from all documents in the system. More precisely, we can use 


TE(t, d) 
LEN(d) 


p(t|d) = (1— A) + Ap(t|background) (9.117) 
where TF(t, d) is the frequency of term ¢ in document d, LEN(d) is the number of words in d, 
and 0 < À < 1 is a smoothing parameter (see e.g., Zhai and Lafferty (2004) for details). 

However, there might be many other signals that we can use to measure relevance. For 
example, the PageRank of a web document is a measure of its authoritativeness, derived from 
the web’s link structure (see Section 17.2.4 for details). We can also compute how often and 
where the query occurs in the document. Below we discuss how to learn how to combine all 
these signals.’ 


The pointwise approach 


Suppose we collect some training data representing the relevance of a set of documents for each 
query. Specifically, for each query g, suppose that we retrieve m possibly relevant documents 
dj, for j = 1: m. For each query document pair, we define a feature vector, x(q,d). For 
example, this might contain the query-document similarity score and the page rank score of the 
document. Furthermore, suppose we have a set of labels y; representing the degree of relevance 
of document d; to query q. Such labels might be binary (e.g., relevant or irrelevant), or they may 
represent a degree of relevance (e.g., very relevant, somewhat relevant, irrelevant). Such labels 
can be obtained from query logs, by thresholding the number of times a document was clicked 
on for a given query. 

If we have binary relevance labels, we can solve the problem using a standard binary clas- 
sification scheme to estimate, p(y = 1|x(q,d)). If we have ordered relevancy labels, we can 
use ordinal regression to predict the rating, p(y = r|x(q,d)). In either case, we can then sort 
the documents by this scoring metric. This is called the pointwise approach to LETOR, and 
is widely used because of its simplicity. However, this method does not take into account the 
location of each document in the list. Thus it penalizes errors at the end of the list just as much 
as errors at the beginning, which is often not the desired behavior. In addition, each decision 
about relevance is made very myopically. 


The pairwise approach 


There is evidence (e.g., (Carterette et al. 2008)) that people are better at judging the relative 
relevance of two items rather than absolute relevance. Consequently, the data might tell us 
that dj is more relevant than dy for a given query, or vice versa. We can model this kind of 
data using a binary classifier of the form p(y;x|x(q,d;),x(q,d,)), where we set yj, = 1 if 
rel(d;,q) > rel(ds, q) and yj, = 0 otherwise. 

One way to model such a function is as follows: 


P(Yjk = 1|Xj,Xx) = sigm(f(x;) — f(xk)) (9.118) 
4, Rather surprisingly, Google does not (or at least, did not as of 2008) using such learning methods in its search engine. 


Source: Peter Norvig, quoted in http: //anand.typepad.com/datawocky/2008/05/are-human-experts-less-p 
rone-to-catastrophic-errors-than-machine-learned-models.html. 
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where f(x) is a scoring function, often taken to be linear, f(x) = w’ x. This is a special 


kind of neural network known as RankNet (Burges et al. 2005) (see Section 16.5 for a general 
discussion of neural networks). We can find the MLE of w by maximizing the log likelihood, or 
equivalently, by minimizing the cross entropy loss, given by 


N mi Mi 
L = $) } Lis (9.119) 
j=1 j=l k=j41 
—Lijx = Uyijer = 1) log p(yijk = 1|Xij, Xik, W) 


+Hi(Yijk = 0) log p(Yijk = Olx; Xik, w) (9.120) 
This can be optimized using gradient descent. A variant of RankNet is used by Microsoft's Bing 
search engine." 


The listwise approach 


The pairwise approach suffers from the problem that decisions about relevance are made just 
based on a pair of items (documents), rather than considering the full context. We now consider 
methods that look at the entire list of items at the same time. 

We can define a total order on a list by specifying a permutation of its indices, m. To model 
our uncertainty about m, we can use the Plackett-Luce distribution, which derives its name 
from independent work by (Plackett 1975) and (Luce 1959). This has the following form: 


p(w|s) =|] = (9.121) 


where s; = s(7~'(j)) is the score of the document ranked at the j’th position. 

To understand Equation 9.121, let us consider a simple example. Suppose m = (A, B,C). 
Then we have that p(m) is the probability of A being ranked first, times the probability of B 
being ranked second given that A is ranked first, times the probabilty of C being ranked third 
given that A and B are ranked first and second. In other words, 

p(s) = ae cy (9.122) 

sa +tsg tsc SBt+SC SC 

To incorporate features, we can define s(d) = f(x(q,d)), where we often take f to be a 
linear function, f(x) = w?x. This is known as the ListNet model (Cao et al. 2007). To train 
this model, let y; be the relevance scores of the documents for query 7. We then minimize the 
cross entropy term 


-XX p(aly:) log p(r]s:) (9.123) 


Of course, as stated, this is intractable, since the i'th term needs to sum over m;! permutations. 
To make this tractable, we can consider permutations over the top k positions only: 
k * 
p(mel81m) = [| sm (9.124) 
ven Su 


j=1 


5. Source: http://www.bing.com/community/site_blogs/b/search/archive/2009/06/01/user-needs-f 
eatures-and-the-science-behind-bing.aspx. 
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There are only m!/(m — k)! such permutations. If we set k = 1, we can evaluate each cross 
entropy term (and its derivative) in O(m) time. 

In the special case where only one document from the presented list is deemed relevant, say 
Yi = c, we can instead use multinomial logistic regression: 
exp(se) 


plyi = c|x) = D 


St (9.125) 
«1 CXP(Ser) 


This often performs at least as well as ranking methods, at least in the context of collaborative 
filtering (Yang et al. 2011). 


Loss functions for ranking 


There are a variety of ways to measure the performance of a ranking system, which we summa- 
rize below. 


e Mean reciprocal rank (MRR). For a query q, let the rank position of its first relevant 
document be denoted by r(q). Then we define the mean reciprocal rank to be 1/r(q). 
This is a very simple performance measure. 

e Mean average precision (MAP). In the case of binary relevance labels, we can define the 
precision at k of some ordering as follows: 


a num. relevant documents in the top k positions of 7 


P@k(7) 7 (9.126) 
We then define the average precision as follows: 
P@k - Ik 
Abi & aN) he (9.127) 


num. relevant documents 


where J;, is 1 iff document k is relevant. For example, if we have the relevancy labels 
y = (1,0,1,0, 1), then the AP is I(t + 2 + 3) =~ 0.76. Finally, we define the mean average 
precision as the AP averaged over all queries. 

e Normalized discounted cumulative gain (NDCG). Suppose the relevance labels have multi- 
ple levels. We can define the discounted cumulative gain of the first k items in an ordering 
as follows: 


k 
DCG@k(r) = rı + $- 
i=2 


r 


= (9.128) 
logy i 
where r; is the relevance of item i and the log, term is used to discount items later in 


the list. Table 9.3 gives a simple numerical example. An alternative definition, that places 
stronger emphasis on retrieving relevant documents, uses 


k 


i] 
DCG@k(r) = X` 
i=l 


— 9.129 
loga (1 + i) ( ) 


The trouble with DCG is that it varies in magnitude just because the length of a returned 
list may vary. It is therefore common to normalize this measure by the ideal DCG, which is 
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i 1 2 3 4 5 6 
ri 3 23 0 1 2 
loggi 0 1 159 20 232 2.59 
tg | NA 2 1887 0 0431 0.772 


Table 9.3 Illustration of how to compute NDCG, from http: //en.wikipedia.org/wiki/Discounted 
_cumulative_gain. The value r; is the relevance score of the item in position i. From this, we see 
that DCG@6 = 3 + (2 + 1.887 + 0 + 0.431 + 0.772) = 8.09. The maximum DCG is obtained using the 
ordering with scores 3, 3, 2, 2, 1, 0. Hence the ideal DCG is 8.693, and so the normalized DCG is 8.09 / 
8.693 = 0.9306. 


the DCG obtained by using the optimal ordering: IDCG@k(r) = argmax, DCG@k(r). This 
can be easily computed by sorting rı:m and then computing DCG@k. Finally, we define 
the normalized discounted cumulative gain or NDCG as DCG/IDCG. Table 9.3 gives a 
simple numerical example. The NDCG can be averaged over queries to give a measure of 
performance. 

e Rank correlation. We can measure the correlation between the ranked list, m, and the 
relevance judegment, 7*, using a variety of methods. One approach, known as the (weighted) 
Kendall's 7 statistics, is defined in terms of the weighted pairwise inconsistency between the 
two lists: 


aks Wuv [1 + sgn(t, — Ty )sen(my, — Ty )] 
2 yen Wuv 


A variety of other measures are commonly used. 


(9.130) 


Tnm) = 


These loss functions can be used in different ways. In the Bayesian approach, we first fit the 
model using posterior inference; this depends on the likelihood and prior, but not the loss. We 
then choose our actions at test time to minimize the expected future loss. One way to do this is 
to sample parameters from the posterior, 9° ~ p(@|D), and then evaluate, say, the precision@k 
for different thresholds, averaging over 0°. See (Zhang et al. 2010) for an example of such an 
approach. 

In the frequentist approach, we try to minimize the empirical loss on the training set. The 
problem is that these loss functions are not differentiable functions of the model parameters. 
We can either use gradient-free optimization methods, or we can minimize a surrogate loss 
function instead. Cross entropy loss (i.e., negative log likelihood) is an example of a widely used 
surrogate loss function. 

Another loss, known as weighted approximate-rank pairwise or WARP loss, proposed in 
(Usunier et al. 2009) and extended in (Weston et al. 2010), provides a better approximation to 
the precision@k loss. WARP is defined as follows: 


WARP(f(x,:),y) = L(rank(f(x,:),y)) (9.131) 
rank(£(x,:),y) = JOIE, y) > f(x,y) (9.132) 
WAY 
k 
L({k) ê S aj, with ay > a2 >--->0 (9.133) 


j=l 
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Here f(a,:) = [f(x,1),...,(x,|y|)] is the vector of scores for each possible output label, 
or, in IR terms, for each possible document corresponding to input query x. The expression 
rank(f (x, :), y) measures the rank of the true label y assigned by this scoring function. Finally, 
L transforms the integer rank into a real-valued penalty. Using a; = 1 and aj; = 0 would 
optimize the proportion of top-ranked correct labels. Setting a1.,; to be non-zero values would 
optimize the top k in the ranked list, which will induce good performance as measured by 
MAP or precision@k. As it stands, WARP loss is still hard to optimize, but it can be further 
approximated by Monte Carlo sampling, and then optimized by gradient descent, as described 
in (Weston et al. 2010). 


Exercises 


Exercise 9.1 Conjugate prior for univariate Gaussian in exponential family form 


Derive the conjugate prior for p and À = 1/o? for a univariate Gaussian using the exponential family, 
by analogy to Section 9.2.5.5. By suitable reparameterization, show that the prior has the form p(y, A) = 
N (uly, A(2a — 1))Ga(Ala, 8), and thus only has 3 free parameters. 


Exercise 9.2 The MVN is in the exponential family 


Show that we can write the MVN in exponential family form. Hint: use the information form defined in 
Section 4.3.3. 


10.1 
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Directed graphical models (Bayes nets) 


Introduction 


I basically know of two principles for treating complicated systems in simple ways: the 
first is the principle of modularity and the second is the principle of abstraction. I 
am an apologist for computational probability in machine learning because I believe that 
probability theory implements these two principles in deep and intriguing ways — namely 
through factorization and through averaging. Exploiting these two mechanisms as fully 
as possible seems to me to be the way forward in machine learning. — Michael Jordan, 
1997 (quoted in (Frey 1998)). 


Suppose we observe multiple correlated variables, such as words in a document, pixels in an 
image, or genes in a microarray. How can we compactly represent the joint distribution p(x|0)? 
How can we use this distribution to infer one set of variables given another in a reasonable 
amount of computation time? And how can we learn the parameters of this distribution with a 
reasonable amount of data? These questions are at the core of probabilistic modeling, inference 
and learning, and form the topic of this chapter. 


Chain rule 


By the chain rule of probability, we can always represent a joint distribution as follows, using 
any ordering of the variables: 


pP(£1:v) = p(t1)p(x2|21)p(x3|e2, z1 )p(z4|x£1, £2, 23)... p(tv|%1:v-1) (10.1) 
where V is the number of variables, the Matlab-like notation 1 : V denotes the set {1,2,..., Vy}, 


and where we have dropped the conditioning on the fixed parameters 0 for brevity. The problem 
with this expression is that it becomes more and more complicated to represent the conditional 
distributions p(a;|x1.,-1) as t gets large. 

For example, suppose all the variables have K states. We can represent p(21) as a table 
of O(K) numbers, representing a discrete distribution (there are actually only K — 1 free 
parameters, due to the sum-to-one constraint, but we write O(K) for simplicity). Similarly, we 
can represent p(x2|21) as a table of O(K?) numbers by writing p(x = j|xı = i) = T;j; we 
say that T is a stochastic matrix, since it satisfies the constraint X, T;; = 1 for all rows 3, 
and 0 < Tj; < 1 for all entries. Similarly, we can represent p(x3|21,22) as a 3d table with 


10.1.2 


10.1.3 


308 Chapter 10. Directed graphical models (Bayes nets) 


O(K?) numbers. These are called conditional probability tables or CPTs. We see that there 
are O(KV) parameters in the model. We would need an awful lot of data to learn so many 
parameters. 

One solution is to replace each CPT with a more parsimonius conditional probability distri- 
bution or CPD, such as multinomial logistic regression, i.e., p(x; = k|Xy.4~1) = S(WtX14-1)k- 
The total number of parameters is now only O(K?V7), making this a compact density model 
(Neal 1992; Frey 1998). This is adequate if all we want to do is evaluate the probability of a fully 
observed vector x1.7. For example, we can use this model to define a class-conditional density, 


p(x|y = c), thus making a generative classifier (Bengio and Bengio 2000). However, this model 


is not useful for other kinds of prediction tasks, since each variable depends on all the previous 
variables. So we need another approach. 


Conditional independence 


The key to efficiently representing large joint distributions is to make some assumptions about 
conditional independence (CI). Recall from Section 2.2.4 that X and Y are conditionally inde- 
pendent given Z, denoted X L Y|Z, if and only if (iff) the conditional joint can be written as 
a product of conditional marginals, i.e., 


X LY|Z <= v(X,Y|Z) = p(X|Z)p(Y|Z) (10.2) 


Let us see why this might help. Suppose we assume that 2441 L x1.:~1|a4, or in words, 
“the future is independent of the past given the present”. This is called the (first order) Markov 
assumption. Using this assumption, plus the chain rule, we can write the joint distribution as 
follows: 


V 
prv) = plz1) | [ pædz) (10.3) 
t=1 


This is called a (first-order) Markov chain. They can be characterized by an initial distribution 
over states, p(x, = i), plus a state transition matrix p(x; = j|x;_, = į). See Section 17.2 for 
more information. 


Graphical models 


Although the first-order Markov assumption is useful for defining distributions on 1d sequences, 
how can we define distributions on 2d images, or 3d videos, or, in general, arbitrary collections 
of variables (such as genes belonging to some biological pathway)? This is where graphical 
models come in. 

A graphical model (GM) is a way to represent a joint distribution by making CI assumptions. 
In particular, the nodes in the graph represent random variables, and the (lack of) edges represent 
CI assumptions. (A better name for these models would in fact be “independence diagrams”, 
but the term “graphical models” is now entrenched.) There are several kinds of graphical model, 
depending on whether the graph is directed, undirected, or some combination of directed and 
undirected. In this chapter, we just study directed graphs. We consider undirected graphs in 
Chapter 19. 
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(a) (b) 


Figure 10.1 (a) A simple DAG on 5 nodes, numbered in topological order. Node 1 is the root, nodes 4 and 
5 are the leaves. (b) A simple undirected graph, with the following maximal cliques: {1, 2,3}, {2, 3, 4}, 


{3,5}. 


Graph terminology 


Before we continue, we must define a few basic terms, most of which are very intuitive. 

A graph G = (V,€) consists of a set of nodes or vertices, V = {1,...,V}, and a set 
of edges, E = {(s,t) : s,t € V}. We can represent the graph by its adjacency matrix, in 
which we write G(s,t) = 1 to denote (s,t) € E, that is, if s —> t is an edge in the graph. 
If G(s,t) = 1 iff G(t,s) = 1, we say the graph is undirected, otherwise it is directed. We 
usually assume G(s, s) = 0, which means there are no self loops. 

Here are some other terms we will commonly use: 


e Parent For a directed graph, the parents of a node is the set of all nodes that feed into it: 
pals) = {t : G(t,s) = 1}. 

e Child For a directed graph, the children of a node is the set of all nodes that feed out of it: 
ch(s) = {t : G(s,t) = 1}. 

e Family For a directed graph, the family of a node is the node and its parents, fam(s) = 
{s} U pa(s). 

e Root For a directed graph, a root is a node with no parents. 

e Leaf For a directed graph, a leaf is a node with no children. 

e Ancestors For a directed graph, the ancestors are the parents, grand-parents, etc of a node. 
That is, the ancestors of t is the set of nodes that connect to t via a trail: anc(t) = {s : s ~> 
th. 

e Descendants For a directed graph, the descendants are the children, grand-children, etc of 
a node. That is, the descendants of s is the set of nodes that can be reached via trails from 
s: desc(s) Ê {t : s ~ t}. 

e Neighbors For any graph, we define the neighbors of a node as the set of all immediately 
connected nodes, nbr(s) = {t : G(s,t) = 1 V G(t,s) = 1}. For an undirected graph, we 
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write s ~ t to indicate that s and t are neighbors (so (s,t) € E is an edge in the graph). 

e Degree The degree of a node is the number of neighbors. For directed graphs, we speak of 
the in-degree and out-degree, which count the number of parents and children. 

e Cycle or loop For any graph, we define a cycle or loop to be a series of nodes such that 
we can get back to where we started by following edges, sı — s2--- — Sn — s1, n > 2. If the 
graph is directed, we may speak of a directed cycle. For example, in Figure 10.1(a), there are 
no directed cycles, but 1 > 2 + 4 > 3 > 1 is an undirected cycle. 

e DAG A directed acyclic graph or DAG is a directed graph with no directed cycles. See 
Figure 10.1(a) for an example. 

e Topological ordering For a DAG, a topological ordering or total ordering is a numbering 
of the nodes such that parents have lower numbers than their children. For example, in 
Figure 10.1(a), we can use (1,2,3,4, 5), or (1,3, 2,5, 4), etc. 

e Path or trail A path or trail s ~> t is a series of directed edges leading from s to t. 

e Tree An undirected tree is an undirectecd graph with no cycles. A directed tree is a DAG in 
which there are no directed cycles. If we allow a node to have multiple parents, we call it a 
polytree, otherwise we call it a moral directed tree. 

e Forest A forest is a set of trees. 

e Subgraph A (node-induced) subgraph G4 is the graph created by using the nodes in A and 
their corresponding edges, G4 = (Va, Ea). 

e Clique For an undirected graph, a clique is a set of nodes that are all neighbors of each 
other. A maximal clique is a clique which cannot be made any larger without losing the 
clique property. For example, in Figure 10.1(b), {1,2} is a clique but it is not maximal, since 
we can add 3 and still maintain the clique property. In fact, the maximal cliques are as 


follows: {1, 2,3}, {2,3,4}, {3,5}. 


Directed graphical models 


A directed graphical model or DGM is a GM whose graph is a DAG. These are more commonly 
known as Bayesian networks. However, there is nothing inherently “Bayesian” about Bayesian 
networks: they are just a way of defining probability distributions. These models are also called 
belief networks. The term “belief” here refers to subjective probability. Once again, there is 
nothing inherently subjective about the kinds of probability distributions represented by DGMs. 
Finally, these models are sometimes called causal networks, because the directed arrows are 
sometimes interpreted as representing causal relations. However, there is nothing inherently 
causal about DGMs. (See Section 26.6.1 for a discussion of causal DGMs.) For these reasons, we 
use the more neutral (but less glamorous) term DGM. 

The key property of DAGs is that the nodes can be ordered such that parents come before 
children. This is called a topological ordering, and it can be constructed from any DAG. Given 
such an order, we define the ordered Markov property to be the assumption that a node only 
depends on its immediate parents, not on all predecessors in the ordering, i.e., 


Ts L Xpred(s)\ pa(s)|Xpa(s) (10.4) 
where pa(s) are the parents of node s, and pred(s) are the predecessors of node s in the 


ordering. This is a natural generalization of the first-order Markov property to from chains to 
general DAGs. 
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Xı 


Figure 10.2 (a) A naive Bayes classifier represented as a DGM. We assume there are D = 4 features, 
for simplicity. Shaded nodes are observed, unshaded nodes are hidden. (b) Tree-augmented naive Bayes 
classifier for D = 4 features. In general, the tree topology can change depending on the value of y. 


For example, the DAG in Figure 10.1(a) encodes the following joint distribution: 
P(X15) = p(zı)plzo|xı)p(zs|r1, sr)plza|er, £2, £3)p(£5|Er, £7, 23,51) (10.5) 
= p(x1)p(©2|x1)p(x3|21)p(w4| 22, 23 )p(x5|23) (10.6) 


In general, we have 


V 
P(xi.v1G) = J [ pelk) (10.7) 
t=1 


where each term p(x+|Xpac¢)) is a CPD. We have written the distribution as p(x|G') to emphasize 
that this equation only holds if the CI assumptions encoded in DAG G are correct. However, 
we will usual drop this explicit conditioning for brevity. If each node has O(F) parents and 
K states, the number of parameters in the model is O(V K F ), which is much less than the 
O(KV) needed by a model which makes no CI assumptions. 


10.2 Examples 


In this section, we show a wide variety of commonly used probabilistic models can be conve- 
niently represented as DGMs. 


10.2.1 Naive Bayes classifiers 


In Section 3.5, we introduced the naive Bayes classifier. This assumes the features are condi- 
tionally independent given the class label. This assumption is illustrated in Figure 10.2(a). This 
allows us to write the joint distirbution as follows: 


D 
ply, x) = p(y) Ires) (10.8) 


The naive Bayes assumption is rather naive, since it assumes the features are conditionally 
independent. One way to capture correlation between the features is to use a graphical model. 
In particular, if the model is a tree, the method is known as a tree-augmented naive Bayes 
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vy T2 T3 ry T2 T3 T4 
(a) (b) 


Figure 10.3 A first and second order Markov chain. 


Figure 10.4 A first-order HMM. 


classifier or TAN model (Friedman et al. 1997). This is illustrated in Figure 10.2(b). The reason 
to use a tree, as opposed to a generic graph, is two-fold. First, it is easy to find the optimal 
tree structure using the Chow-Liu algorithm, as explained in Section 26.3. Second, it is easy to 
handle missing features in a tree-structured model, as we explain in Section 20.2. 


Markov and hidden Markov models 


Figure 10.3(a) illustrates a first-order Markov chain as a DAG. Of course, the assumption that the 
immediate past, £+—1, captures everything we need to know about the entire history, X1:4—2, is 
a bit strong. We can relax it a little by adding a dependence from 2;_2 to x; as well; this is 
called a second order Markov chain, and is illustrated in Figure 10.3(b). The corresponding 
joint has the following form: 


T 
P(Xi:7) = p(£1, ©2)p(#3|"1, V2)p(L4|v2, 3)... = p(X1, £2) [[ velar, X42) (10.9) 
t=3 
We can create higher-order Markov models in a similar way. See Section 17.2 for a more detailed 
discussion of Markov models. 

Unfortunately, even the second-order Markov assumption may be inadequate if there are long- 
range correlations amongst the observations. We can't keep building ever higher order models, 
since the number of parameters will blow up. An alternative approach is to assume that there 
is an underlying hidden process, that can be modeled by a first-order Markov chain, but that 
the data is a noisy observation of this process. The result is known as a hidden Markov model 
or HMM, and is illustrated in Figure 10.4. Here z; is known as a hidden variable at “time” t, 
and z+ is the observed variable. (We put “time” in quotation marks, since these models can be 
applied to any kind of sequence data, such as genomics or language, where t represents location 
rather than time.) The CPD p(z;|z:_1) is the transition model, and the CPD p(x;|z;) is the 
observation model. 
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ho hy ho | P(v = Olha, h2) P(v = hy, h2) 
1 0 0 8 L=% 

1 1 0 0001 1 — 8o01 

1 0 1 O02 1 Oye, 

1 1 1 000102 1— 000102 


Table 10.1 Noisy-OR CPD for 2 parents augmented with leak node. We have omitted the t subscript for 
brevity. 


The hidden variables often represent quantities of interest, such as the identity of the word 
that someone is currently speaking. The observed variables are what we measure, such as the 
acoustic waveform. What we would like to do is estimate the hidden state given the data, i.e., to 
compute p(z:|X1:+,0). This is called state estimation, and is just another form of probabilistic 
inference. See Chapter 17 for further details on HMMs. 


Medical diagnosis 


Consider modeling the relationship between various variables that are measured in an intensive 
care unit (ICU), such as the breathing rate of a patient, their blood pressure, etc. The alarm 
network in Figure 10.5(a) is one way to represent these (in)dependencies (Beinlich et al. 1989). 
This model has 37 variables and 504 parameters. 

Since this model was created by hand, by a process called knowledge engineering, it is 
known as a probabilistic expert system. In Section 10.4, we discuss how to learn the parameters 
of DGMs from data, assuming the graph structure is known, and in Chapter 26, we discuss how 
to learn the graph structure itself. 

A different kind of medical diagnosis network, known as the quick medical reference or 
QMR network (Shwe et al. 1991), is shown in Figure 10.5(b). This was designed to model infectious 
diseases. The QMR model is a bipartite graph structure, with diseases (causes) at the top and 
symptoms or findings at the bottom. All nodes are binary. We can write the distribution as 
follows: 


p(y, h) = | [ o(hs) | | p(eelpacey) (10.10) 
s t 


where h, represent the hidden nodes (diseases), and v, represent the visible nodes (symptoms). 

The CPD for the root nodes are just Bernoulli distributions, representing the prior probability 
of that disease. Representing the CPDs for the leaves (symptoms) using CPTs would require 
too many parameters, because the fan-in (number of parents) of many leaf nodes is very 
high. A natural alternative is to use logistic regression to model the CPD, p(v = 1|hpat)) = 
sigm(w? hyact))- (A DGM in which the CPDs are logistic regression distributions is known as a 
sigmoid belief net (Neal 1992).) However, since the parameters of this model were created by 
hand, an alternative CPD, known as the noisy-OR model, was used. 

The noisy-OR model assumes that if a parent is on, then the child will usually also be on 
(since it is an or-gate), but occasionally the “links” from parents to child may fail, independently 
at random. In this case, even if the parent is on, the child may be off. To model this more 
precisely, let 6,; = 1 — qst be the probability that the s —> t link fails, so qst = 1— 0st = plv: 
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Figure 10.5 (a) The alarm network. Figure generated by visualizeAlarmNetwork. (b) The QMR 
network. 
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GP G™ | p(X=a) p(X=b) p(X=0) p(X =ab) 
a a 1 0 0 0 
a b 0 0 0 1 
a o 1 0 0 0 
b a 0 0 0 1 
b b 0 1 0 0 
b o 0 1 0 0 
o a Il 0 0 0 
o b 0 1 0 0 
o o 0 0 1 0 


Table 10.2 CPT which encodes a mapping from genotype to phenotype (bloodtype). This is a determin- 
istic, but many-to-one, mapping. 


1|h, = 1,h_, = 0) is the probability that s can activate t on its own (its “causal power”). The 
only way for the child to be off is if all the links from all parents that are on fail independently 
at random. Thus 


pa =O|h)= JJ ays? (10.11) 


s€pa(t) 


Obviously, p(v, = 1]h) = 1 — p(y, = Olh). 

If we observe that v; = 1 but all its parents are off, then this contradicts the model. Such 
a data case would get probability zero under the model, which is problematic, because it is 
possible that someone exhibits a symptom but does not have any of the specified diseases. To 
handle this, we add a dummy leak node ho, which is always on; this represents “all other 
causes”. The parameter qo; represents the probability that the background leak can cause the 
effect on its own. The modified CPD becomes p(v; = 0|h) = O02 TT sepacty g's. See Table 10.1 
for a numerical example. 

If we define wst = log(6s+), we can rewrite the CPD as 


plv; = 1|h) = 1 — exp (+ + 5 hau (10.12) 


S 


We see that this is similar to a logistic regression model. 

Bipartite models with noisy-OR CPDs are called BN20 models. It is relatively easy to set the 
Ost parameters by hand, based on domain expertise. However, it is also possible to learn them 
from data (see e.g, (Neal 1992; Meek and Heckerman 1997)). Noisy-OR CPDs have also proved 
useful in modeling human causal learning (Griffiths and Tenenbaum 2005), as well as general 
binary classification settings (Yuille and Zheng 2009). 


Genetic linkage analysis * 


Another important (and historically very early) application of DGMs is to the problem of genetic 
linkage analysis. We start with a pedigree graph, which is a DAG that representing the 
relationship between parents and children, as shown in Figure 10.6(a). We then convert this to a 
DGM, as we explain below. Finally we perform probabilistic inference in the resulting model. 
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Locus #1 Locus #2 


iT 


(b) 


Figure 10.6 Left: family tree, circles are females, squares are males. Individuals with the disease of 
interest are highlighted. Right: DGM for two loci. Blue nodes X;; is the observed phenotype for individual 
i at locus j. All other nodes are hidden. Orange nodes GR! ™ is the paternal/ maternal allele. Small 


red nodes a are the paternal/ maternal selection switching variables. These are linked across loci, 
zij > Zij+1 and zij > zi jr The founder (root) nodes do not have any parents, and hence do no need 


switching variables. Based on Figure 3 from (Friedman et al. 2000). 
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In more detail, for each person (or animal) 7 and location or locus j along the genome, we 
create three nodes: the observed marker X;; (which can be a property such as blood type, 
or just a fragment of DNA that can be measured), and two hidden alleles, Gij and Gip one 
inherited from i’s mother (maternal allele) and the other from 7’s father (paternal allele). Together, 
the ordered pair Gi; = (G17, G? i) constitutes 7’s hidden genotype at locus j. 

Obviously we must add G77 3 Xij and Gij — Xij arcs representing the fact that genotypes 
cause phenotypes (observed raanitestadotis of genotypes). The CPD p(Xi;|G7j, Gij ) is called 
the penetrance model. As a very simple example, suppose X;; € {A,B,O, ABY represents 
person i’s observed bloodtype, and GH, Gi; € {A, B,O} is their genotype. We can repre- 
sent the penetrance model using the deterministic CPD shown in Table 10.2. For example, A 
dominates O, so if a person has genotype AO or OA, their phenotype will be A. 

In addition, we add arcs from i's mother and father into G;;, reflecting the Mendelian 
inheritance of genetic material from one’s parents. More precisely, let m; = k be is mother. 
Then G%" could either be equal to Gi". or G?.,, that is, is maternal allele is a copy of one of its 
mother’s two alleles. Let 27’ be a hidden variable than specifies the choice. We can model this 


using the following CPD, known as the inheritance model: 


(GR =GV) if Zp =m 


(Gp = GR) if Z =p (10.13) 


WETICE, Gh, 2) = { 
We can define p(Gi Hales kj? G? j Ze ;) Similarly, where k = p; is i's father. a values of the Zij 


iy? Gm, Zr j and Ae constitute 


are said to specify the phase of the eae The values of G? 
the haplotype of person i at locus j. 

Next, we need to specify the prior for the root nodes, p(G7#) and p(G?;). This is called 
the founder model, and represents the overall prevalence of difference kinds of alleles in the 
population. We usually assume independence between the loci for these founder alleles. 

Finally, we need to specify priors for the switch variables that control the inheritance process. 
These variables are spatially correlated, since adjacent sites on the genome are typically inherited 
together (recombination events are rare). We can model this by imposing a two-state Markov 
chain on the Z’s, where the probability of switching state at locus j is given by 0; = (1 — 
e™?di), where d; is the distance between loci j and j + 1. This is called the recombination 
model. 

The resulting DGM is shown in Figure 10.6(b): it is a series of replicated pedigree DAGs, 
augmented with switching Z variables, which are linked using Markov chains. (There is a 
related model known as phylogenetic HMM (Siepel and Haussler 2003), which is used to model 
evolution amongst phylogenies.) 

As a simplified example of how this model can be used, suppose we only have one locus, 
corresponding to blood type. For brevity, we will drop the j index. Suppose we observe z; = A. 
Then there are 3 possible genotypes: G; is (A, A), (A,O) or (O,A). There is ambiguity 
because the genotype to phenotype mapping is many-to-one. We want to reverse this mapping. 
This is known as an inverse problem. Fortunately, we can use the blood types of relatives to 
help disambiguate the evidence. Information will “flow” from the other x;’s up to their G,’s, 
then across to i’s G; via the pedigree DAG. Thus we can combine our local evidence p(x;|G;) 


1. Sometimes the observed marker is equal to the unphased genotype, which is the unordered set {G}, j? Gry however, 


the phased or hidden genotype is not directly measurable. 
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with an informative prior, p(G;|x_;), conditioned on the other data, to get a less entropic local 
posterior, p(G;|x) x p(xi|Gi)p(Gi|x—i). 

In practice, the model is used to try to determine where along the genome a given disease- 
causing gene is assumed to lie — this is the genetic linkage analysis task. The method works as 
follows. First, suppose all the parameters of the model, including the distance between all the 
marker loci, are known. The only unknown is the location of the disease-causing gene. If there 
are L marker loci, we construct L + 1 models: in model Z, we postulate that the disease gene 
comes after marker £, for 0 < £ < L + 1. We can estimate the Markov switching parameter 6p, 
and hence the distance de between the disease gene and its nearest known locus. We measure 
the quality of that model using its likelihood, p(D|6). We then can then pick the model with 
highest likelihood (which is equivalent to the MAP model under a uniform prior). 

Note, however, that computing the likelihood requires marginalizing out all the hidden Z 
and G variables. See (Fishelson and Geiger 2002) and the references therein for some exact 
methods for this task; these are based on the variable elimination algorithm, which we discuss 
in Section 20.3. Unfortunately, for reasons we explain in Section 20.5, exact methods can be 
computationally intractable if the number of individuals and/or loci is large. See (Albers et al. 
2006) for an approximate method for computing the likelihood; this is based on a form of 
variational inference, which we will discuss in Section 22.4.1. 


Directed Gaussian graphical models * 


Consider a DGM where all the variables are real-valued, and all the CPDs have the following 
form: 


P(@e|Xpace)) = N (vel ee + We Xpa(t)> o?) (10.14) 


This is called a linear Gaussian CPD. As we show below, multiplying all these CPDs together 
results in a large joint Gaussian distribution of the form p(x) = N (x|u, ©). This is called a 
directed GGM, or a Gaussian Bayes net. 

We now explain how to derive u and © from the CPD parameters, following (Shachter and 
Kenley 1989, App. B). For convenience, we will rewrite the CPDs in the following form: 


Tt = Ht + 5 Wts(Ls — Hs) + O42t (10.15) 
s€pa(t) 


where z; ~ N (0,1), og is the conditional standard deviation of x, given its parents, ws is the 
strength of the s — t edge, and p is the local mean.” 

It is easy to see that the global mean is just the concatenation of the local means, u = 
(u1,.--, 4p). We now derive the global covariance, ©. Let S = diag(o) be a diagonal matrix 
containing the standard deviations. We can rewrite Equation 10.15 in matrix-vector form as 
follows: 


(x — p) = W(x- u) + Sz (10.16) 


2. If we do not subtract off the parent’s mean (ie., if we use £¢ = put + X sepa(t) Wists +o+2t), the derivation of © 
is much messier, as can be seen by looking at (Bishop 2006b, p370). 
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Now let e be a vector of noise terms: 

e ê Sz (10.17) 
We can rearrange this to get 

e = (I- W)(x- p) (10.18) 


Since W is lower triangular (because w;, = 0 if t > s in the topological ordering), we have that 
I — W is lower triangular with 1s on the diagonal. Hence 


1 
- —wet 1 ø — Hi 
7) a | -wsz -w3 1 a (10.19) 
e i 7 Ta- 
a —Wdl —Wa2 ... —Wad-1 1 an 
Since I — W is always invertible, we can write 
x — p= (I— W)'e £ Ue = USz (10.20) 


where we defined U = (I— W)~1. Thus the regression weights correspond to a Cholesky 
decomposition of ©, as we now show: 


x = cov [x] = cov [x — u] (10.21) 
= cov[USz] = US cov [z] SUT = US? UT (10.22) 


Inference 


We have seen that graphical models provide a compact way to define joint probability distribu- 
tions. Given such a joint distribution, what can we do with it? The main use for such a joint 
distribution is to perform probabilistic inference. This refers to the task of estimating unknown 
quantities from known quantities. For example, in Section 10.2.2, we introduced HMMs, and 
said that one of the goals is to estimate the hidden states (e.g., words) from the observations 
(e.g., speech signal). And in Section 10.2.4, we discussed genetic linkage analysis, and said that 
one of the goals is to estimate the likelihood of the data under various DAGs, corresponding to 
different hypotheses about the location of the disease-causing gene. 

In general, we can pose the inference problem as follows. Suppose we have a set of correlated 
random variables with joint distribution p(x.y|@). (In this section, we are assuming the 
parameters @ of the model are known. We discuss how to learn the parameters in Section 10.4.) 
Let us partition this vector into the visible variables x,,, which are observed, and the hidden 
variables, x;,, which are unobserved. Inference refers to computing the posterior distribution 
of the unknowns given the knowns: 

P(Xn, Xv|0) P(Xh, Xv|4) 

D(Xn|Xv, 0) PGO) Te, POL, X18) (10.23) 
Essentially we are conditioning on the data by clamping the visible variables to their observed 
values, X,, and then normalizing, to go from p(X}, X») to p(xp|X,). The normalization constant 
p(X,|@) is the likelihood of the data, also called the probability of the evidence. 


10.4 
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Sometimes only some of the hidden variables are of interest to us. So let us partition the 
hidden variables into query variables, x,, whose value we wish to know, and the remaining 
nuisance variables, x,,, which we are not interested in. We can compute what we are interested 
in by marginalizing out the nuisance variables: 


P(Xq|Xv, 9) =} r Xq; Xn|Xv, 0) (10.24) 


In Section 4.3.1, we saw how to perform all these operations for a multivariate Gaussian in 
O(V3) time, where V is the number of variables. What if we have discrete random variables, 
with say K states each? If the joint distribution is represented as a multi-dimensional table, 
we can always perform these operations exactly, but this will take O(KV) time. In Chapter 20, 
we explain how to exploit the factorization encoded by the GM to perform these operations in 
O(V K+") time, where w is a quantity known as the treewidth of the graph. This measures 
how “tree-like” the graph is. If the graph is a tree (or a chain), we have w = 1, so for these 
models, inference takes time linear in the number of nodes. Unfortunately, for more general 
graphs, exact inference can take time exponential in the number of nodes, as we explain in 
Section 20.5. We will therefore examine various approximate inference schemes later in the 


book. 


Learning 


In the graphical models literature, it is common to distinguish between inference and learning. 
Inference means computing (functions of) p(x;,|x,,@), where v are the visible nodes, h are the 
hidden nodes, and @ are the parameters of the model, assumed to be known. Learning usually 
means computing a MAP estimate of the parameters given data: 


N 
Ô = argmax 5 log p(Xi,»|0) + log p(0) (10.25) 
6 i=l 
where x;,, are the visible variables in case i. If we have a uniform prior, p(@) œ 1, this reduces 
to the MLE, as usual. 

If we adopt a Bayesian view, the parameters are unknown variables and should also be 
inferred. Thus to a Bayesian, there is no distinction between inference and learning. In fact, we 
can just add the parameters as nodes to the graph, condition on D, and then infer the values 
of all the nodes. (We discuss this in more detail below.) 

In this view, the main difference between hidden variables and parameters is that the number 
of hidden variables grows with the amount of training data (since there is usually a set of hidden 
variables for each observed data case), whereas the number of parameters in usually fixed (at 
least in a parametric model). This means that we must integrate out the hidden variables to avoid 
overfitting, but we may be able to get away with point estimation techniques for parameters, 
which are fewer in number. 


Plate notation 


When inferring parameters from data, we often assume the data is iid. We can represent this 
assumption explicitly using a graphical model, as shown in Figure 10.7(a). This illustrates the 
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Figure 10.7 Left: data points x; are conditionally independent given 0. Right: Plate notation. This 
represents the same model as the one on the left, except the repeated x; nodes are inside a box, known as 
a plate; the number in the lower right hand corner, N, specifies the number of repetitions of the X; node. 


assumption that each data case was generated independently but from the same distribution. 
Notice that the data cases are only independent conditional on the parameters 0; marginally, 
the data cases are dependent. Nevertheless, we can see that, in this example, the order in which 
the data cases arrive makes no difference to our beliefs about 0, since all orderings will have 
the same sufficient statistics. Hence we say the data is exchangeable. 

To avoid visual clutter, it is common to use a form of syntactic sugar called plates: we 
simply draw a little box around the repeated variables, with the convention that nodes within 
the box will get repeated when the model is unrolled. We often write the number of copies or 
repetitions in the bottom right corner of the box. See Figure 10.7(b) for a simple example. The 
corresponding joint distribution has the form 


N 
p(9, D) = p(0) TI ee o) (10.26) 


This DGM represents the CI assumptions behind the models we considered in Chapter 5. 

A slightly more complex example is shown in Figure 10.8. On the left we show a naive Bayes 
classifier that has been “unrolled” for D features, but uses a plate to represent repetition over 
cases i = 1: N. The version on the right shows the same model using nested plate notation. 
When a variable is inside two plates, it will have two sub-indices. For example, we write 0,. 
to represent the parameter for feature j in class-conditional density c. Note that plates can 
be nested or crossing. Notational devices for modeling more complex parameter tying patterns 
can be devised (e.g., (Heckerman et al. 2004)), but these are not widely used. What is not clear 
from the figure is that je is used to generate xi; iff y; = c, otherwise it is ignored. This is an 
example of context specific independence, since the CI relationship x;; L 0je only holds if 


Yi FC. 
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Figure 10.8 Naive Bayes classifier as a DGM. (a) With single plates. (b) WIth nested plates. 


Learning from complete data 


If all the variables are fully observed in each case, so there is no missing data and there are no 
hidden variables, we say the data is complete. For a DGM with complete data, the likelihood is 
given by 
N N V v 
pDl) = [r0 = [Eeka 0) = [pl (10.27) 
i=1 i=1t=1 t=1 
where D, is the data associated with node t and its parents, i.e., the tth family. This is a 
product of terms, one per CPD. We say that the likelihood decomposes according to the graph 
structure. 
Now suppose that the prior factorizes as well: 


v 
p0) = | [p0 (10.28) 
t=1 
Then clearly the posterior also factorizes: 
V 
p(OID) x p(D|@)p(4) = | | p(o) (10.29) 
t=1 


This means we can compute the posterior of each CPD independently. In other words, 
factored prior plus factored likelihood implies factored posterior (10.30) 


Let us consider an example, where all CPDs are tabular, thus extending the earlier results of 
Secion 3.5.1.2, where discussed Bayesian naive Bayes. We have a separate row (i.e., a separate 
multinoulli distribution) for each conditioning case, i.e., for each combination of parent values, 
as in Table 10.2. Formally, we can write the tth CPT as x;|Xpaj) = c ~ Cat(O;-), where 


bick = p(t: = k|Xpat) = c), for k = 1 : Ki, c = 1 : C; and t = 1 : T. Here K; is the number 
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of states for node t, C; = Tsepace) K, is the number of parent combinations, and T is the 


number of nodes. Obviously X- Otek = 1 for each row of each CPT. 

Let us put a separate Dirichlet prior on each row of each CPT, i.e., Ore ~ Dir(a;:,). Then we 
can compute the posterior by simply adding the pseudo counts to the empirical counts to get 
6:.|D ~ Dir(Nic + Qc), where Nick is the number of times that node t is in state k while its 


parents are in state c: 


N 
Neck = X ait = K, £i pat) = ©) (10.31) 
i=l 


From Equation 2.77, the mean of this distribution is given by the following: 
Nick 
tek T Qtek (10.32) 
Xg (Neck + Qtek’) 
For example, consider the DGM in Figure 10.1(a). Suppose the training data consists of the 
following 5 cases: 


Otek = 


ooro © 
ll el le) 
SF Orr 
KH oOo.. OC 
oo or Oo 


Below we list all the sufficient statistics N;.,, and the posterior mean parameters 0;,;, under 
a Dirichlet prior with a;.% = 1 (corresponding to add-one smoothing) for the t = 4 node: 


£2 £3 | Nick=1 Nick=o | tck=1  Ptck=0 


0 0 0 0 1/2 1/2 
1 0 1 0 2/3 1/3 
0 1 0 1 1/3 2/3 
1 1 2 1 3/5 2/5 


It is easy to show that the MLE has the same form as Equation 10.32, except without the Qtek 
terms, i.e., 


E Nick 
pe tck (10.33) 


Of course, the MLE suffers from the zero-count problem discussed in Section 3.3.4.1, so it is 
important to use a prior to regularize the estimation problem. 


Learning with missing and/or latent variables 


If we have missing data and/or hidden variables, the likelihood no longer factorizes, and indeed 
it is no longer convex, as we explain in detail in Section 11.3. This means we will usually can 
only compute a locally optimal ML or MAP estimate. Bayesian inference of the parameters is 
even harder. We discuss suitable approximate inference techniques in later chapters. 
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Conditional independence properties of DGMs 


At the heart of any graphical model is a set of conditional indepence (CI) assumptions. We write 
x4 Le xp|xc if A is independent of B given C in the graph G, using the semantics to be 
defined below. Let I(G) be the set of all such CI statements encoded by the graph. 

We say that G is an I-map (independence map) for p, or that p is Markov wrt G, iff 
I(G) C I(p), where I(p) is the set of all CI statements that hold for distribution p. In other 
words, the graph is an I-map if it does not make any assertions of CI that are not true of the 
distribution. This allows us to use the graph as a safe proxy for p when reasoning about p’s CI 
properties. This is helpful for designing algorithms that work for large classes of distributions, 
regardless of their specific numerical parameters 0. 

Note that the fully connected graph is an I-map of all distributions, since it makes no CI 
assertions at all (since it is not missing any edges). We therefore say G is a minimal I-map of 
p if G is an I-map of p, and if there is no G’ C G which is an I-map of p. 

It remains to specify how to determine if x4 La xg|xc. Deriving these independencies 
for undirected graphs is easy (see Section 19.2), but the DAG situation is somewhat complicated, 
because of the need to respect the orientation of the directed edges. We give the details below. 


d-separation and the Bayes Ball algorithm (global Markov properties) 


First, we introduce some definitions. We say an undirected path P is d-separated by a set of 
nodes F (containing the evidence) iff at least one of the following conditions hold: 


1. P contains a chain, s => m > t ors + m & t, where m € E 
2. P contains a tent or fork, s /®N t, where m € E 


3. P contains a collider or v-structure, s Nm, t, where m is not in E and nor is any 
descendant of m. 


Next, we say that a set of nodes A is d-separated from a different set of nodes B given a 
third observed set E iff each undirected path from every node a € A to every node b € B is 
d-separated by Æ. Finally, we define the CI properties of a DAG as follows: 


x4 LG Xg|xg <=> A is d-separated from B given E (10.34) 


The Bayes ball algorithm (Shachter 1998) is a simple way to see if A is d-separated from B 
given E, based on the above definition. The idea is this. We “shade” all nodes in Æ, indicating 
that they are observed. We then place “balls” at each node in A, let them “bounce around” 
according to some rules, and then ask if any of the balls reach any of the nodes in B. The three 
main rules are shown in Figure 10.9. Notice that balls can travel opposite to edge directions. 
We see that a ball can pass through a chain, but not if it is shaded in the middle. Similarly, a 
ball can pass through a fork, but not if it is shaded in the middle. However, a ball cannot pass 
through a v-structure, unless it is shaded in the middle. 

We can justify the 3 rules of Bayes ball as follows. First consider a chain structure X > Y > 
Z, which encodes 


P(x, y; z) = p(x)p(ylx)p(zly) (10.35) 
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X Z 
(a) (b) 
X Z 
X y Z 
O—O-—--O 
= 
Y 
(c) (d) 


Y 


(e) (f) 


Figure 10.9 Bayes ball rules. A shaded node is one we condition on. If there is an arrow hitting a bar, it 
means the ball cannot pass through; otherwise the ball can pass through. Based on Jordan 2007). 


When we condition on y, are x and z independent? We have 


pla zly) = PRUEBEN _ PEDEN — polypy) 00:36 


and therefore x L z|y. So observing the middle node of chain breaks it in two (as in a Markov 
chain). 
Now consider the tent structure X + Y —> Z. The joint is 


P(x, y; z) = ply)p(z|y)p(z|y) (10.37) 
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O10 om 
(a) (b) (c) 


Figure 10.10 (a-b) Bayes ball boundary conditions. (c) Example of why we need boundary conditions. y’ 
is an observed child of y, rendering y “effectively observed”, so the ball bounces back up on its way from 
x to z. 


When we condition on y, are x and z independent? We have 


ay = eB) _ MUP — oaly)o(el) ce 


and therefore x L z|y. So observing a root node separates its children (as in a naive Bayes 
classifier: see Section 3.5). 
Finally consider a v-structure X — Y + Z. The joint is 


P(x, Ys z) = p(x) p(z)p(y|a, z) (10.39) 
When we condition on y, are x and z independent? We have 


p(x)p(z)p(y|a, z) 
p(y) 


p(x, zly) = (10.40) 


so x £ z\y. However, in the unconditional distribution, we have 


p(x, z) = p(x)p(z) (10.41) 


so we see that x and z are marginally independent. So we see that conditioning on a common 
child at the bottom of a v-structure makes its parents become dependent. This important effect 
is called explaining away, inter-causal reasoning, or Berkson’s paradox. As an example of 
explaining away, suppose we toss two coins, representing the binary numbers 0 and 1, and we 
observe the “sum” of their values. A priori, the coins are independent, but once we observe 
their sum, they become coupled (e.g., if the sum is 1, and the first coin is 0, then we know the 
second coin is 1). 

Finally, Bayes Ball also needs the “boundary conditions’ shown in Figure 10.10(a-b). To 
understand where these rules come from, consider Figure 10.10(c). Suppose Y’ is a noise-free 
copy of Y. Then if we observe Y’, we effectively observe Y as well, so the parents X and Z 
have to compete to explain this. So if we send a ball down X —> Y — Y”, it should “bounce 
back” up along Y’ + Y — Z. However, if Y and all its children are hidden, the ball does not 
bounce back. 


10.5.2 


10.5.3 


10.5. Conditional independence properties of DGMs 327 


Figure 10.11 A DGM. 


For example, in Figure 10.11, we see that x2 L x6|x5, since the 2 — 5 — 6 path is blocked 
by x5 (which is observed), the 2 + 4 — 7 — 6 path is blocked by æy (which is hidden), and 
the 2 + 1 —> 3 — 6 path is blocked by xı (which is hidden). However, we also see that 
xq L w6|a5,27, since now the 2 + 4 + 7 — 6 path is no longer blocked by x7 (which is 
observed). Exercise 10.2 gives you some more practice in determining CI relationships for DGMs. 


Other Markov properties of DGMs 
From the d-separation criterion, one can conclude that 
t L nd(t) \ pa(t)|pa(t) (10.42) 


where the non-descendants of a node nd(t) are all the nodes except for its descendants, 
nd(t) = V \ {t U desc(t)}. Equation 10.42 is called the directed local Markov property. For 
example, in Figure 10.11, we have nd(3) = {2,4}, and pa(3) = 1, so 3 L 2,41. 

A special case of this property is when we only look at predecessors of a node according to 
some topological ordering. We have 


t L pred(t) \ pa(t)|pa(t) (10.43) 


which follows since pred(t) C nd(t). This is called the ordered Markov property, which 
justifies Equation 10.7. For example, in Figure 10.11, if we use the ordering 1,2,...,7. we find 
pred(3) = {1,2} and pa(3) = 1, so 3 L 21. 

We have now described three Markov properties for DAGs: the directed global Markov property 
G in Equation 10.34, the ordered Markov property O in Equation 10.43, and the directed local 
Markov property L in Equation 10.42. It is obvious that ŒG => L => O. What is less 
obvious, but nevertheless true, is that O ==> L => G (see e.g., (Koller and Friedman 2009) 
for the proof). Hence all these properties are equivalent. 

Furthermore, any distribution p that is Markov wrt G can be factorized as in Equation 10.7; 
this is called the factorization property F. It is obvious that O = > F, but one can show that 
the converse also holds (see e.g., (Koller and Friedman 2009) for the proof). 


Markov blanket and full conditionals 


The set of nodes that renders a node t conditionally independent of all the other nodes in 
the graph is called ts Markov blanket; we will denote this by mb(t). One can show that the 
Markov blanket of a node in a DGM is equal to the parents, the children, and the co-parents, 


10.6 
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i.e., other nodes who are also parents of its children: 

mb(t) = ch(t) U pa(t) U copa(t) (10.44) 
For example, in Figure 10.11, we have 

mb(5) = {6,7} U {2,3} U {4} = {2,3, 4,6, 7} (10.45) 


where 4 is a co-parent of 5 because they share a common child, namely 7. 

To see why the co-parents are in the Markov blanket, note that when we derive p(x;|x_;) = 
p(t, X_—+)/p(x_+), all the terms that do not involve x; will cancel out between numerator and 
denominator, so we are left with a product of CPDs which contain x+ in their scope. Hence 


plax) x p(xelXpaey) [[ p(eslxpacsy) (10.46) 
s€ch(t) 


For example, in Figure 10.11 we have 


p(@5|X_5) x p(%5|%2, 23)p(x6|23, £5) p(r7|L4, £5, £6) (10.47) 


The resulting expression is called t's full conditional, and will prove to be important when we 
study Gibbs sampling (Section 24.2). 


Influence (decision) diagrams * 


We can represent multi-stage (Bayesian) decision problems by using a graphical notation known 
as a decision diagram or an influence diagram (Howard and Matheson 1981; Kjaerulff and 
Madsen 2008). This extends directed graphical models by adding decision nodes (also called ac- 
tion nodes), represented by rectangles, and utility nodes (also called value nodes), represented 
by diamonds. The original random variables are called chance nodes, and are represented by 
ovals, as usual. 

Figure 10.12(a) gives a simple example, illustrating the famous oil wild-catter problem? In 
this problem, you have to decide whether to drill an oil well or not. You have two possible 
actions: d = 1 means drill, d = 0 means don’t drill. You assume there are 3 states of nature: 
o = 0 means the well is dry, o = 1 means it is wet (has some oil), and o = 2 means it is 
soaking (has a lot of oil). Suppose your prior beliefs are p(o) = [0.5, 0.3, 0.2]. Finally, you must 
specify the utility function U (d,o). Since the states and actions are discrete, we can represent 
it as a table (analogous to a CPT in a DGM). Suppose we use the following numbers, in dollars: 

| o=0 o=1 o=2 
d=0 0 0 0 
d=1 -70 50 200 

We see that if you don't drill, you incur no costs, but also make no money. If you drill a dry 
well, you lose $70; if you drill a wet well, you gain $50; and if you drill a soaking well, you gain 
$200. Your prior expected utility if you drill is given by 


2 
EU(d = 1) = Ý p(o)U (d, 0) = 0.5 - (—70) + 0.3 - 50 + 0.2 - 200 = 20 (10.48) 
o=0 


3. This example is originally from (Raiffa 1968). Our presentation is based on some notes by Daphne Koller. 
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Figure 10.12 (a) Influence diagram for basic oil wild catter problem. (b) An extension in which we have 
an information arc from the Sound chance node to the Drill decision node. (c) An extension in which we 
get to decide whether to perform the test or not. 


Your expected utility if you don't drill is 0. So your maximum expected utility is 

MEU = max{EU(d = 0), EU(d = 1)} = max{0, 20} = 20 (10.49) 
and therefore the optimal action is to drill: 

d* = argmax{EU(d = 0), EU (d = 1)} = 1 (10.50) 


Now let us consider a slight extension to the model. Suppose you perform a sounding to 
estimate the state of the well. The sounding observation can be in one of 3 states: s = 0 is 
a diffuse reflection pattern, suggesting no oil; s = 1 is an open reflection pattern, suggesting 
some oil; and s = 2 is a closed reflection pattern, indicating lots of oil. Since S is caused by O, 
we add an O -> S arc to our model. In addition, we assume that the outcome of the sounding 
test will be available before we decide whether to drill or not; hence we add an information 
arc from S to D. This is illustrated in Figure 10.12(b). 

Let us model the reliability of our sensor using the following conditional distribution for 


p(slo): 
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| s=0 s=l ¢=2 
o=0 0.6 0.3 0.1 
o= l 0.3 0.4 0.3 
o=2 0.1 0.4 0.5 


Suppose we do the sounding test and we observe s = 0. The posterior over the oil state is 
p(o|s = 0) = (0.732, 0.219, 0.049] (10.51) 


Now your posterior expected utility of performing action d is 


U(d|s =0) = X p(ols = 0)U(o,d) (10.52) 


o=0 
If d = 1, this gives 
EU(d = 1|s = 0) = 0.732 x (—70) + 0.219 x 50 + 0.049 x 200 = —30.5 (10.53) 


However, if d = 0, then EU(d = O|s = 0) = 0, since not drilling incurs no cost. So if we 
observe s = 0, we are better off not drilling, which makes sense. 

Now suppose we do the sounding test and we observe s = 1. By similar reasoning, one 
can show that EU(d = 1|s = 1) = 32.9, which is higher than EU(d = O|s = 1) = 0. 
Similarly, if we observe s = 2, we have EU(d = 1|s = 2) = 87.5 which is much higher 
than EU(d = O|s = 2) = 0. Hence the optimal policy d*(s) is as follows: if s = 0, choose 
d*(0) = 0 and get $0; if s = 1, choose d*(1) = 1 and get $32.9; and if s = 2, choose d*(2) = 1 
and get $87.5. 

You can compute your expected profit or maximum expected utility as follows: 


MEU = p(s) EU(d"(s)|s) (10.54) 


S 


This is the expected utility given possible outcomes of the sounding test, assuming you act 
optimally given the outcome. The prior marginal on the outcome of the test is 


= Diplo p(o = (0.41, 0.35, 0.24] (10.55) 


Hence your maximum expected utility is 
MEU = 0.41 x 04 0.35 x 32.9 + 0.24 x 87.5 = 32.2 (10.56) 


Now suppose you can choose whether to do the test or not. This can be modelled as shown 
in Figure 10.12(c), where we add a new test node T. If T = 1, we do the test, and S' can enter 1 
of 3 states, determined by O, exactly as above. If T = 0, we don't do the test, and S' enters a 
special unknown state. There is also some cost associated with performing the test. 

Is it worth doing the test? This depends on how much our MEU changes if we know the 
outcome of the test (namely the state of S). If you don’t do the test, we have MEU = 20 
from Equation 10.49. If you do the test, you have MEU = 32.2 from Equation 10.56. So the 
improvement in utility if you do the test (and act optimally on its outcome) is $12.2. This is 


10.6. Influence (decision) diagrams * 331 


Figure 10.13 (a) A POMDP, shown as an influence diagram. z; are hidden world states. We implicitly 
make the no forgetting assumption, which effectively means that a; has arrows coming into it from all 
previous observations, £1:+. (b) An MDP, shown as an influence diagram. 


called the value of perfect information (VPI). So we should do the test as long as it costs less 
than $12.2. 

In terms of graphical models, the VPI of a variable T can be determined by computing the 
MEU for the base influence diagram, J, and then computing the MEU for the same influence 
diagram where we add information arcs from T to the action nodes, and then computing the 
difference. In other words, 


VPI = MEU(I + T > D) — MEU(I) (10.57) 


where D is the decision node and T is the variable we are measuring. 

It is possible to modify the variable elimination algorithm (Section 20.3) so that it computes 
the optimal policy given an influence diagram. These methods essentially work backwards from 
the final time-step, computing the optimal decision at each step assuming all following actions 
are chosen optimally. See e.g., (Lauritzen and Nilsson 2001; Kjaerulff and Madsen 2008) for 
details. 

We could continue to extend the model in various ways. For example, we could imagine a 
dynamical system in which we test, observe outcomes, perform actions, move on to the next 
oil well, and continue drilling (and polluting) in this way. In fact, many problems in robotics, 
business, medicine, public policy, etc. can be usefully formulated as influence diagrams unrolled 
over time (Raiffa 1968; Lauritzen and Nilsson 2001; Kjaerulff and Madsen 2008). 

A generic model of this form is shown in Figure 10.13(a). This is known as a partially 
observed Markov decision process or POMDP (pronounced “pom-d-p”). This is basically a 
hidden Markov model (Section 17.3) augmented with action and reward nodes. This can be used 
to model the perception-action cycle that all intelligent agents use (see e.g., (Kaelbling et al. 
1998) for details). 

A special case of a POMDP, in which the states are fully observed, is called a Markov decision 
process or MDP, shown in Figure 10.13(b). This is much easier to solve, since we only have 
to compute a mapping from observed states to actions. This can be solved using dynamic 
programming (see e.g., (Sutton and Barto 1998) for details). 

In the POMDP case, the information arc from x; to a, is not sufficient to uniquely determine 
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Figure 10.14 Some DGMs. 


the best action, since the state is not fully observed. Instead, we need to choose actions based 
on our belief state, p(z;|x14,@1-.). Since the belief updating process is deterministic (see 
Section 17.4.2), we can compute a belief state MDP. For details on to compute the policies for 
such models, see e.g., (Kaelbling et al. 1998; Spaan and Vlassis 2005). 


Exercises 


Exercise 10.1 Marginalizing a node in a DGM 

(Source: Koller.) 

Consider the DAG G in Figure 10.14(a). Assume it is a minimal I-map for p(A, B,C, D, E, F, X). Now 
consider marginalizing out X. Construct a new DAG G” which is a minimal I-map for p(A, B, C, D, E, F). 
Specify (and justify) which extra edges need to be added. 

Exercise 10.2 Bayes Ball 

(Source: Jordan.) 


Here we compute some global independence statements from some directed graphical models. You can 
use the “Bayes ball” algorithm, the d-separation criterion, or the method of converting to an undirected 
graph (all should give the same results). 


a. Consider the DAG in Figure 10.14(b). List all variables that are independent of A given evidence on B. 
b. Consider the DAG in Figure 10.14(c). List all variables that are independent of A given evidence on J. 


Exercise 10.3 Markov blanket for a DGM 
Prove that the full conditional for node 7 in a DGM is given by 


P(Xi|X-i) x p(XilPa(Xi)) J[ p(¥i|Pa(¥5)) (10.58) 
Yj €ch(X;) 


where ch(X;) are the children of X; and Pa(Y;) are the parents of Y}. 


Exercise 10.4 Hidden variables in DGMs 


Consider the DGMs in Figure 11.1 which both define p(X1.6), where we number empty nodes left to right, 
top to bottom. The graph on the left defines the joint as 


p(X1:6) =2 e) p(X2)p(X3)p(H = h|.X1.3)p(Xa|H = h)p(Xs|H = h)p(X6|H = h) (10.59) 
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Thickness 


Figure 10.15 (a) Weather BN. (b) Fishing BN. 


where we have marginalized over the hidden variable H. The graph on the right defines the joint as 


p(X1.6) = p(X1)p(X2)p(X3)p(X4|X1:3)p(X5|X1:4)p(X6| X15) (10.60) 


a. (5 points) Assuming all nodes (including H) are binary and all CPDs are tabular, prove that the model 
on the left has 17 free parameters. 


b. (5 points) Assuming all nodes are binary and all CPDs are tabular, prove that the model on the right 
has 59 free parameters. 


c. (5 points) Suppose we have a data set D = Xis for n = 1 : N, where we observe the Xs but not H, 
and we want to estimate the parameters of the CPDs using maximum likelihood. For which model is 
this easier? Explain your answer. 


Exercise 10.5 Bayes nets for a rainy day 


(Source: Nando de Freitas.). In this question you must model a problem with 4 binary variables: Œ ="gray”, 
V ="Vancouver”, R ="rain” and S ="sad”. Consider the directed graphical model describing the relation- 
ship between these variables shown in Figure 10.15(a). 


a. Write down an expression for P(S = 1|V = 1) in terms of a, b, y, ô. 


b. Write down an expression for P(S = 1|V = 0). Is this the same or different to P(S = 1|V = 1)? 
Explain why. 


c. Find maximum likelihood estimates of a, 6, y using the following data set, where each row is a training 
case. (You may state your answers without proof.) 


(10.61) 
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Exercise 10.6 Fishing nets 
(Source: (Duda et al. 2001)..) Consider the Bayes net shown in Figure 10.15(b). Here, the nodes represent 
the following variables 
Xı € {winter, spring, summer, autumn}, X2 € {salmon, sea bass} (10.62) 
X3 € {light, medium, dark}, X4 € {wide, thin} (10.63) 
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Figure 10.16 (a) A QMR-style network with some hidden leaves. (b) Removing the barren nodes. 


The corresponding conditional probability tables are 


pti) = ( .25 .25 .25 .25 ), p(xalai) = (10.64) 


co Rw © 
NON 


33° 33.34 A 6 
p(as|02) = o 4 l ) sede) =( 3 (10.65) 


Note that in p(x4|x2), the rows represent x2 and the columns z4 (so each row sums to one and represents 
the child of the CPD). Thus p(x4 = thin|v2 = sea bass) = 0.05, p(aa = thin|x2 = salmon) = 0.6, etc. 


Answer the following queries. You may use matlab or do it by hand. In either case, show your work. 


a. Suppose the fish was caught on December 20 — the end of autumn and the beginning of winter — 
and thus let p(xı1) = (.5, 0, 0, .5) instead of the above prior. (This is called soft evidence, since we 
do not know the exact value of X1, but we have a distribution over it.) Suppose the lightness has not 
been measured but it is known that the fish is thin. Classify the fish as salmon or sea bass. 


b. Suppose all we know is that the fish is thin and medium lightness. What season is it now, most likely? 
Use p(xı) = ( .25 .25 .25 .25 ) 


Exercise 10.7 Removing leaves in BN20 networks 


a. Consider the QMR network, where only some of the symtpoms are observed. For example, in Fig- 
ure 10.16(a), X4 and X5 are hidden. Show that we can safely remove all the hidden leaf nodes without 
affecting the posterior over the disease nodes, i.e., prove that we can compute p(Z1:3|11, £2, £4) using 
the network in Figure 10.16(b). This is called barren node removal, and can be applied to any DGM. 


b. Now suppose we partition the leaves into three groups: on, off and unknown. Clearly we can remove the 
unknown leaves, since they are hidden and do not affect their parents. Show that we can analytically 
remove the leaves that are in the “off state”, by absorbing their effect into the prior of the parents. 
(This trick only works for noisy-OR CPDs.) 


Exercise 10.8 Handling negative findings in the QMR network 


Consider the QMR network. Let d be the hidden diseases, f~ be the negative findings (leaf nodes that are 
off), and f~ be the positive findings (leaf nodes that are on). We can compute the posterior p(d|f'f*) in 
two steps: first absorb the negative findings, p(d|f~ ) x p(d)p(f~ |d), then absorb the positive findings, 
p(d|f~ ,£*) œx p(d|f~ )p(£*|d). Show that the first step can be done in O(|d||f~ |) time, where |d| is 
the number of dieases and |f~ | is the number of negative findings. For simplicity, you can ignore leak 
nodes. (Intuitively, the reason for this is that there is no correlation induced amongst the parents when 
the finding is off, since there is no explaining away.) 
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Exercise 10.9 Moralization does not introduce new independence statements 


Recall that the process of moralizing a DAG means connecting together all “unmarried” parents that share 
a common child, and then dropping all the arrows. Let M be the moralization of DAG G. Show that 
CI(M) C CI(G), where CI are the set of conditional independence statements implied by the model. 


11.1 


11.2 


Mixture models and the EM algorithm 


Latent variable models 


In Chapter 10 we showed how graphical models can be used to define high-dimensional joint 
probability distributions. The basic idea is to model dependence between two variables by 
adding an edge between them in the graph. (Technically the graph represents conditional 
independence, but you get the point.) 

An alternative approach is to assume that the observed variables are correlated because they 
arise from a hidden common “cause”. Model with hidden variables are also known as latent 
variable models or LVMs. As we will see in this chapter, such models are harder to fit than 
models with no latent variables. However, they can have significant advantages, for two main 
reasons. First, [VMs often have fewer parameters than models that directly represent correlation 
in the visible space. This is illustrated in Figure 11.1. If all nodes (including H) are binary and all 
CPDs are tabular, the model on the left has 17 free parameters, whereas the model on the right 
has 59 free parameters. 

Second, the hidden variables in an LVM can serve as a bottleneck, which computes a 
compressed representation of the data. This forms the basis of unsupervised learning, as we 
will see. Figure 11.2 illustrates some generic LVM structures that can be used for this purpose. 
In general there are L latent variables, z;,,...,zr,, and D visible variables, 7;1,...,2ip, 
where usually D >> L. If we have L > 1, there are many latent factors contributing to each 
observation, so we have a many-to-many mapping. If L = 1, we we only have a single latent 
variable; in this case, z; is usually discrete, and we have a one-to-many mapping. We can 
also have a many-to-one mapping, representing different competing factors or causes for each 
observed variable; such models form the basis of probabilistic matrix factorization, discussed 
in Section 27.6.2. Finally, we can have a one-to-one mapping, which can be represented as 
Zi — x;. By allowing z; and/or x; to be vector-valued, this representation can subsume all the 
others. Depending on the form of the likelihood p(x;|z;) and the prior p(z;), we can generate 
a variety of different models, as summarized in Table 11.1. 


Mixture models 


The simplest form of LVM is when z; € {1,..., K}, representing a discrete latent state. We will 
use a discrete prior for this, p(z;) = Cat(z). For the likelihood, we use p(x;|z; = k) = pr(xi), 
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fe UN 


17 parameters 59 parameters 


Figure 11.1 A DGM with and without hidden variables. The leaves represent medical symptoms. The 
roots represent primary causes, such as smoking, diet and exercise. The hidden variable can represent 
mediating factors, such as heart disease, which might not be directly visible. 
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Figure 11.2 A latent variable model represented as a DGM. (a) Many-to-many. (b) One-to-many. (c) 
Many-to-one. (d) One-to-one. 


where px is the k’th base distribution for the observations; this can be of any type. The overall 
model is known as a mixture model, since we are mixing together the K base distributions as 
follows: 


K 
p(x:10) = X` mrpr (xi) aD 
k=1 


This is a convex combination of the p;’s, since we are taking a weighted sum, where the 
mixing weights m, satisfy 0 < a, < 1 and Sa Tk = 1. We give some examples below. 


11.2.1 
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p(xilzi) p(zi) Name Section 
MVN Discrete Mixture of Gaussians 11.2.1 
Prod. Discrete Discrete Mixture of multinomials 11.2.2 


Prod. Gaussian Prod. Gaussian Factor analysis/ probabilistic PCA 12.1.5 
Prod. Gaussian Prod. Laplace Probabilistic ICA/ sparse coding 12.6 


Prod. Discrete Prod. Gaussian Multinomial PCA 27.2.3 
Prod. Discrete Dirichlet Latent Dirichlet allocation 27.3 
Prod. Noisy-OR Prod. Bernoulli BN20/ QMR 10.2.3 
Prod. Bernoulli Prod. Bernoulli Sigmoid belief net 27.7 


Table 11.1 Summary of some popular directed latent variable models. Here “Prod” means product, so 
“Prod. Discrete” in the likelihood means a factored distribution of the form |, Cat(x:;|zi), and “Prod. 
Gaussian” means a factored distribution of the form |]; M’(xi;|z:). “PCA” stands for “principal components 
analysis”. “ICA” stands for “indepedendent components analysis”. 


(a) (b) 


Figure 11.3 A mixture of 3 Gaussians in 2d. (a) We show the contours of constant probability for each 
component in the mixture. (b) A surface plot of the overall density. Based on Figure 2.23 of (Bishop 2006a). 
Figure generated by mixGaussPlotDemo. 


Mixtures of Gaussians 


The most widely used mixture model is the mixture of Gaussians (MOG), also called a Gaussian 
mixture model or GMM. In this model, each base distribution in the mixture is a multivariate 
Gaussian with mean p, and covariance matrix X. Thus the model has the form 


K 
P(x:|0) = X. TN (xi ey, Ee) (11.2) 
k=1 
Figure 11.3 shows a mixture of 3 Gaussians in 2D. Each mixture component is represented by a 
different set of eliptical contours. Given a sufficiently large number of mixture components, a 
GMM can be used to approximate any density defined on RP. 


11.2.2 


11.2.3 
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Mixture of multinoullis 


We can use mixture models to define density models on many kinds of data. For example, 
suppose our data consist of D-dimensional bit vectors. In this case, an appropriate class- 
conditional density is a product of Bernoullis: 


D l 
p(xilzi = k, 0) = | | Ber(wijluje) = |] 52 — uj)” (11.3) 
j=1 


j=1 


where pjp is the probability that bit j turns on in cluster k. 

The latent variables do not have to any meaning, we might simply introduce latent variables 
in order to make the model more powerful. For example, one can show (Exercise 11.8) that the 
mean and covariance of the mixture distribution are given by 


Bix] = So remy (11.4) 
k 

cov [x] = So mp[Se + oyun] — E [x] E [x]” (11.5) 
k 


where X, = diag(jx(1 — Hjk)). So although the component distributions are factorized, 
the joint distribution is not. Thus the mixture distribution can capture correlations between 
variables, unlike a single product-of-Bernoullis model. 


Using mixture models for clustering 


There are two main applications of mixture models. The first is to use them as a black-box 
density model, p(x;). This can be useful for a variety of tasks, such as data compression, outlier 
detection, and creating generative classifiers, where we model each class-conditional density 
p(x|y = c) by a mixture distribution (see Section 14.7.3). 

The second, and more common, application of mixture models is to use them for clustering. 
We discuss this topic in detail in Chapter 25, but the basic idea is simple. We first fit the mixture 
model, and then compute p(z; = k|x;, 0), which represents the posterior probability that point 
i belongs to cluster k. This is known as the responsibility of cluster k for point i, and can be 
computed using Bayes rule as follows: 

rik È p(zi = kx 0) = Ma = ena 2) (11.6) 
p PCi = FO) p(xilzi = k', @) 
This procedure is called soft clustering, and is identical to the computations performed when 
using a generative classifier. The difference between the two models only arises at training time: 
in the mixture case, we never observe z;, whereas with a generative classifier, we do observe y; 
(which plays the role of z;). 

We can represent the amount of uncertainty in the cluster assignment by using 1 — max, fik- 
Assuming this is small, it may be reasonable to compute a hard clustering using the MAP 
estimate, given by 


z* = arg max rx = arg max log p(xi|z; = k, 0) + log p(z; = k|@) (11.7) 
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Figure 11.4 (a) Some yeast gene expression data plotted as a time series. (c) Visualizing the 16 cluster 
centers produced by K-means. Figure generated by kmeansYeastDemo. 


Figure 11.5 We fit a mixture of 10 Bernoullis to the binarized MNIST digit data. We show the MLE for the 
corresponding cluster means, y4,. The numbers on top of each image represent the mixing weights îy. 
No labels were used when training the model. Figure generated by mixBerMnistEM. 


Hard clustering using a GMM is illustrated in Figure 1.8, where we cluster some data rep- 
resenting the height and weight of people. The colors represent the hard assignments. Note 
that the identity of the labels (colors) used is immaterial; we are free to rename all the clusters, 
without affecting the partitioning of the data; this is called label switching. 

Another example is shown in Figure 11.4. Here the data vectors x; € R” represent the 
expression levels of different genes at 7 different time points. We clustered them using a GMM. 
We see that there are several kinds of genes, such as those whose expression level goes up 
monotonically over time (in response to a given stimulus), those whose expression level goes 
down monotonically, and those with more complex response patterns. We have clustered the 
series into K = 16 groups. (See Section 11.5 for details on how to choose K.) For example, we 
can represent each cluster by a prototype or centroid. This is shown in Figure 11.4(b). 

As an example of clustering binary data, consider a binarized version of the MNIST handwrit- 
ten digit dataset (see Figure 1.5(a)), where we ignore the class labels. We can fit a mixture of 
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Bernoullis to this, using A = 10, and then visualize the resulting centroids, p, as shown in 
Figure 11.5. We see that the method correctly discovered some of the digit classes, but overall the 
results aren't great: it has created multiple clusters for some digits, and no clusters for others. 
There are several possible reasons for these “errors”: 


e The model is very simple and does not capture the relevant visual characteristics of a digit. 
For example, each pixel is treated independently, and there is no notion of shape or a stroke. 


e Although we think there should be 10 clusters, some of the digits actually exhibit a fair degree 
of visual variety. For example, there are two ways of writing 7’s (with and without the cross 
bar). Figure 1.5(a) illustrates some of the range in writing styles. Thus we need K > 10 
clusters to adequately model this data. However, if we set K to be large, there is nothing 
in the model or algorithm preventing the extra clusters from being used to create multiple 
versions of the same digit, and indeed this is what happens. We can use model selection 
to prevent too many clusters from being chosen but what looks visually appealing and what 
makes a good density estimator may be quite different. 


e The likelihood function is not convex, so we may be stuck in a local optimum, as we explain 
in Section 11.3.2. 


This example is typical of mixture modeling, and goes to show one must be very cautious 
trying to “interpret” any clusters that are discovered by the method. (Adding a little bit of 
supervision, or using informative priors, can help a lot.) 


Mixtures of experts 


Section 14.7.3 described how to use mixture models in the context of generative classifiers. We 
can also use them to create discriminative models for classification and regression. For example, 
consider the data in Figure 11.6(a). It seems like a good model would be three different linear 
regression functions, each applying to a different part of the input space. We can model this by 
allowing the mixing weights and the mixture densities to be input-dependent: 


plyilx zi =k,0) = N(yslwexi, oR) (11.8) 
plzilxi 0) = Cat(z;|S(V7x;)) (11.9) 


See Figure 11.7(a) for the DGM. 

This model is called a mixture of experts or MoE (Jordan and Jacobs 1994). The idea is that 
each submodel is considered to be an “expert” in a certain region of input space. The function 
p(z; = k|x;,@) is called a gating function, and decides which expert to use, depending on 
the input values. For example, Figure 11.6(b) shows how the three experts have “carved up” the 
ld input space, Figure 11.6(a) shows the predictions of each expert individually (in this case, the 
experts are just linear regression models), and Figure 11.6(c) shows the overall prediction of the 
model, obtained using 


pluilxi 0) = X` p(z = k|xi, O)p(yilxs, zi = k, 0) (11.10) 
k 


We discuss how to fit this model in Section 11.4.3. 
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Figure 11.6 (a) Some data fit with three separate regression lines. (b) Gating functions for three different 
“experts”. (c) The conditionally weighted average of the three expert predictions. Figure generated by 


mixexpDemo. 
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Figure 11.7 (a) A mixture of experts. (b) A hierarchical mixture of experts. 
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Figure 11.8 (a) Some data from a simple forwards model. (b) Some data from the inverse model, fit 
with a mixture of 3 linear regressions. Training points are color coded by their responsibilities. (c) The 
predictive mean (red cross) and mode (black square). Based on Figures 5.20 and 5.21 of (Bishop 2006b). 
Figure generated by mixexpDemoOneToMany. 


It should be clear that we can “plug in” any model for the expert. For example, we can use 
neural networks (Chapter 16) to represent both the gating functions and the experts. The result 
is known as a mixture density network. Such models are slower to train, but can be more 
flexible than mixtures of experts. See (Bishop 1994) for details. 

It is also possible to make each expert be itself a mixture of experts. This gives rise to a 
model known as the hierarchical mixture of experts. See Figure 11.7(b) for the DGM, and 
Section 16.2.6 for further details. 


Application to inverse problems 


Mixtures of experts are useful in solving inverse problems. These are problems where we have 
to invert a many-to-one mapping. A typical example is in robotics, where the location of the 
end effector (hand) y is uniquely determined by the joint angles of the motors, x. However, 
for any given location y, there are many settings of the joints x that can produce it. Thus the 
inverse mapping x = f~‘(y) is not unique. Another example is kinematic tracking of people 
from video (Bo et al. 2008), where the mapping from image appearance to pose is not unique, 
due to self occlusion, etc. 
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Figure 11.9 A LVM represented as a DGM. Left: Model is unrolled for N examples. Right: same model 
using plate notation. 


A simpler example, for illustration purposes, is shown in Figure 11.8(a). We see that this 
defines a function, y = f(a), since for every value x along the horizontal axis, there is a 
unique response y. This is sometimes called the forwards model. Now consider the problem 
of computing x = f~'(y). The corresponding inverse model is shown in Figure 11.8(b); this is 
obtained by simply interchanging the x and y axes. Now we see that for some values along 
the horizontal axis, there are multiple possible outputs, so the inverse is not uniquely defined. 
For example, if y = 0.8, then x could be 0.2 or 0.8. Consequently, the predictive distribution, 
p(x|y, 0) is multimodal. 

We can fit a mixture of linear experts to this data. Figure 11.8(b) shows the prediction of each 
expert, and Figure 11.8(c) shows (a plugin approximation to) the posterior predictive mode and 
mean. Note that the posterior mean does not yield good predictions. In fact, any model which 
is trained to minimize mean squared error — even if the model is a flexible nonlinear model, 
such as neural network — will work poorly on inverse problems such as this. However, the 
posterior mode, where the mode is input dependent, provides a reasonable approximation. 


Parameter estimation for mixture models 


We have seen how to compute the posterior over the hidden variables given the observed 
variables, assuming the parameters are known. In this section, we discuss how to learn the 
parameters. 

In Section 10.4.2, we showed that when we have complete data and a factored prior, the 
posterior over the parameters also factorizes, making computation very simple. Unfortunately 
this is no longer true if we have hidden variables and/or missing data. The reason is apparent 
from looking at Figure 11.9. If the z; were observed, then by d-separation, we see that 0, | 0..|D, 
and hence the posterior will factorize. But since, in an LVM, the z; are hidden, the parameters 
are no longer independent, and the posterior does not factorize, making it much harder to 
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Figure 11.10 Left: N = 200 data points sampled from a mixture of 2 Gaussians in ld, with 7, = 0.5, 
on = 5, pı = —10 and u2 = 10. Right: Likelihood surface p(D|/11, p2), with all other parameters set 
to their true values. We see the two symmetric modes, reflecting the unidentifiability of the parameters. 
Figure generated by mixGaussLikSurfaceDemo. 


compute. This also complicates the computation of MAP and ML estimates, as we discus below. 


Unidentifiability 


The main problem with computing p(@|D) for an LVM is that the posterior may have multiple 
modes. To see why, consider a GMM. If the z; were all observed, we would have a unimodal 
posterior for the parameters: 


K 
p(OID) = Dir(r|D) [| NIW(ur, BaD) 
k=1 


(11.1) 


Consequently we can easily find the globally optimal MAP estimate (and hence globally optimal 
MLE). 

But now suppose the z;’s are hidden. In this case, for each of the possible ways of “filling in” 
the z;’s, we get a different unimodal likelihood. Thus when we marginalize out over the z;’s, we 
get a multi-modal posterior for p(@|D).' These modes correspond to different labelings of the 
clusters. This is illustrated in Figure 11.10(b), where we plot the likelihood function, p(D| u1, p2), 
for a 2D GMM with K = 2 for the data is shown in Figure 11.10(a). We see two peaks, one 
corresponding to the case where 44; = —10, u2 = 10, and the other to the case where u1 = 10, 
u2 = —10. We say the parameters are not identifiable, since there is not a unique MLE. 
Therefore there cannot be a unique MAP estimate (assuming the prior does not rule out certain 
labelings), and hence the posterior must be multimodal. The question of how many modes there 


1. Do not confuse multimodality of the parameter posterior, p(@|D), with the multimodality defined by the model, 
p(x|@). In the latter case, if we have K clusters, we would expect to only get K peaks, although it is theoretically 
possible to get more than K, at least if D > 1 (Carreira-Perpinan and Williams 2003). 
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are in the parameter posterior is hard to answer. There are K! possible labelings, but some of 
the peaks might get merged. Nevertheless, there can be an exponential number, since finding 
the optimal MLE for a GMM is NP-hard (Aloise et al. 2009; Drineas et al. 2004). 

Unidentifiability can cause a problem for Bayesian inference. For example, suppose we 
draw some samples from the posterior, 0°) ~ p(@\D), and then average them, to try to 
approximate the posterior mean, 0 = oun 6‘), (This kind of Monte Carlo approach is 
explained in more detail in Chapter 24.) If the samples come from different modes, the average 
will be meaningless. Note, however, that it is reasonable to average the posterior predictive 
distributions, p(x) ~ Up aan p(x|@), since the likelihood function is invariant to which 
mode the parameters came from. 

A variety of solutions have been proposed to the unidentifiability problem. These solutions 
depend on the details of the model and the inference algorithm that is used. For example, see 
(Stephens 2000) for an approach to handling unidentifiability in mixture models using MCMC. 

The approach we will adopt in this chapter is much simpler: we just compute a single 
local mode, i.e., we perform approximate MAP estimation. (We say “approximate” since finding 
the globally optimal MLE, and hence MAP estimate, is NP-hard, at least for mixture models 
(Aloise et al. 2009).) This is by far the most common approach, because of its simplicity. It 
is also a reasonable approximation, at least if the sample size is sufficiently large. To see why, 
consider Figure 11.9(a). We see that there are N latent variables, each of which gets to “see” 
one data point each. However, there are only two latent parameters, each of which gets to 
see N data points. So the posterior uncertainty about the parameters is typically much less 
than the posterior uncertainty about the latent variables. This justifies the common strategy 
of computing p(z;|x;,@), but not bothering to compute p(@|D). In Section 5.6, we will study 
hierarchical Bayesian models, which essentially put structure on top of the parameters. In such 
models, it is important to model p(@|D), so that the parameters can send information between 
themselves. If we used a point estimate, this would not be possible. 


Computing a MAP estimate is non-convex 


In the previous sections, we have argued, rather heuristically, that the likelihood function has 
multiple modes, and hence that finding an MAP or ML estimate will be hard. In this section, we 
show this result by more algebraic means, which sheds some additional insight into the problem. 
Our presentation is based in part on (Rennie 2004). 

Consider the log-likelihood for an LVM: 


log p(D|@) = X` log £ P(X, “i0) (11.12) 


Unfortunately, this objective is hard to maximize. since we cannot push the log inside the sum. 
This precludes certain algebraic simplications, but does not prove the problem is hard. 

Now suppose the joint probability distribution p(z;,x;|0) is in the exponential family, which 
means it can be written as follows: 


p(x, 2/0) = exp[0" $(x, z)] (11.13) 


1 
Z(0) 
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where #(x,z) are the sufficient statistics, and 7(@) is the normalization constant (see Sec- 
tion 9.2 for more details). It can be shown (Exercise 9.2) that the MVN is in the exponential 
family, as are nearly all of the distributions we have encountered so far, including Dirichlet, 
multinomial, Gamma, Wishart, etc. (The Student distribution is a notable exception.) Further- 
more, mixtures of exponential families are also in the exponential family, providing the mixing 
indicator variables are observed (Exercise 11.11). 

With this assumption, the complete data log likelihood can be written as follows: 


£.(8) = Slog p(x;, 2:10) = 67 (S (xi, z:)) — NZ(0) (11.14) 


The first term is clearly linear in 0. One can show that Z(0) is a convex function (Boyd and 
Vandenberghe 2004), so the overall objective is concave (due to the minus sign), and hence has 
a unique maximum. 

Now consider what happens when we have missing data. The observed data log likelihood 
is given by 


(0) = 5 log X` p(xi, zil) = 5 log bs pie) — N log Z(@) (11.15) 


One can show that the log-sum-exp function is convex (Boyd and Vandenberghe 2004), and we 
know that Z(0) is convex. However, the difference of two convex functions is not, in general, 
convex. So the objective is neither convex nor concave, and has local optima. 

The disadvantage of non-convex functions is that it is usually hard to find their global op- 
timum. Most optimization algorithms will only find a local optimum; which one they find 
depends on where they start. There are some algorithms, such as simulated annealing (Sec- 
tion 24.6.1) or genetic algorithms, that claim to always find the global optimum, but this is only 
under unrealistic assumptions (e.g., if they are allowed to be cooled “infinitely slowly”, or al- 
lowed to run “infinitely long”). In practice, we will run a local optimizer, perhaps using multiple 
random restarts to increase out chance of finding a “good” local optimum. Of course, careful 
initialization can help a lot, too. We give examples of how to do this on a case-by-case basis. 

Note that a convex method for fitting mixtures of Gaussians has been proposed. The idea 
is to assign one cluster per data point, and select from amongst them, using a convex £1-type 
penalty, rather than trying to optimize the locations of the cluster centers. See (Lashkari and 
Golland 2007) for details. This is essentially an unsupervised version of the approach used in 
sparse kernel logistic regression, which we will discuss in Section 14.3.2. Note, however, that the 
lı penalty, although convex, is not necessarily a good way to promote sparsity, as discussed in 
Chapter 13. In fact, as we will see in that Chapter, some of the best sparsity-promoting methods 
use non-convex penalties, and use EM to optimie them! The moral of the story is: do not be 
afraid of non-convexity. 


The EM algorithm 


For many models in machine learning and statistics, computing the ML or MAP parameter 
estimate is easy provided we observe all the values of all the relevant random variables, i.e., if 
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Model Section 
Mix. Gaussians 11.4.2 
Mix. experts 11.4.3 
Factor analysis 12.1.5 
Student T 11.4.5 
Probit regression 11.4.6 
DGM with hidden variables 11.4.4 
MVN with missing data 11.6.1 
HMMs 17.5.2 


Shrinkage estimates of Gaussian means Exercise 11.13 


Table 11.2 Some models discussed in this book for which EM can be easily applied to find the ML/ MAP 
parameter estimate. 


we have complete data. However, if we have missing data and/or latent variables, then computing 
the ML/MAP estimate becomes hard. 

One approach is to use a generic gradient-based optimizer to find a local minimum of the 
negative log likelihood or NLL, given by 


NLL(@) = — ê x los (18) (11.16) 


However, we often have to enforce constraints, such as the fact that covariance matrices must be 
positive definite, mixing weights must sum to one, etc., which can be tricky (see Exercise 11.5). In 
such cases, it is often much simpler (but not always faster) to use an algorithm called expectation 
maximization, or EM for short (Dempster et al. 1977; Meng and van Dyk 1997; McLachlan and 
Krishnan 1997). This is a simple iterative algorithm, often with closed-form updates at each step. 
Furthermore, the algorithm automatically enforce the required constraints. 

EM exploits the fact that if the data were fully observed, then the ML/ MAP estimate would be 
easy to compute. In particular, EM is an iterative algorithm which alternates between inferring 
the missing values given the parameters (E step), and then optimizing the parameters given the 
“filled in” data (M step). We give the details below, followed by several examples. We end with 
a more theoretical discussion, where we put the algorithm in a larger context. See Table 11.2 for 
a summary of the applications of EM in this book. 


Basic idea 


Let x; be the visible or observed variables in case i, and let z; be the hidden or missing 
variables. The goal is to maximize the log likelihood of the observed data: 


N N 
¢(0) = X` log p(xi|@) = X` log Ds P(Xi, n10) (11.17) 
w=1 j=l Zi 


Unfortunately this is hard to optimize, since the log cannot be pushed inside the sum. 
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EM gets around this problem as follows. Define the complete data log likelihood to be 


N 
le(0) = X log p(xi, 2:8) (11.18) 


i=l 


This cannot be computed, since z; is unknown. So let us define the expected complete data 


log likelihood as follows: 


Q(0,0°"') = E [£.(6)|D, a *] (11.19) 


where t is the current iteration number. Q is called the auxiliary function. The expectation 
is taken wrt the old parameters, 6’~', and the observed data D. The goal of the E step is to 
compute Q(a,0'"*), or rather, the terms inside of it which the MLE depends on; these are 
known as the expected sufficient statistics or ESS. In the M step, we optimize the Q function 
wit 0: 


6' = arg max Q(8, 0*7) (11.20) 


To perform MAP estimation, we modify the M step as follows: 
0' = argmax Q (0, 0*1) + log p(0) (11.21) 
0 


The E step remains unchanged. 

In Section 11.4.7 we show that the EM algorithm monotonically increases the log likelihood of 
the observed data (plus the log prior, if doing MAP estimation), or it stays the same. So if the 
objective ever goes down, there must be a bug in our math or our code. (This is a surprisingly 
useful debugging tool!) 

Below we explain how to perform the E and M steps for several simple models, that should 
make things clearer. 


EM for GMMs 


In this section, we discuss how to fit a mixture of Gaussians using EM. Fitting other kinds of 
mixture models requires a straightforward modification — see Exercise 11.3. We assume the 
number of mixture components, K, is known (see Section 11.5 for discussion of this point). 
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11.4.2.1 Auxiliary function 
The expected complete data log likelihood is given by 


Q(0,0®-)) £ E £ log p(xi, 0) (11.22) 


= SE foz Tr mEp(Xi|Ox)) tas) (11.23) 


XOXE [a = k)] logimkp(x:|0x)] (11.24) 
i k 


XOY pla = k|xi, 0°") log[m.p(*i19%)] (11.25) 
i k 


5 5 Tik log Tk + 5 5 rik log p(xilðk) (11.26) 
i k i k 


where riz £ p(zi = k|xi, 0-9) is the responsibility that cluster k takes for data point i. 
This is computed in the E step, described below. 


1.4.2.2 Estep 


The E step has the following simple form, which is the same for any mixture model: 


: 7 o» 
ne = ee (11.27) 


Ey TH P(X) 


11.4.2.3 M step 


In the M step, we optimize Q wrt m and the Ox. For m, we obviously have 
1 Tk 
= 2> Sik KE (11.28) 
N£ N 


where ry = >, fig is the weighted number of points assigned to cluster k. 
To derive the M step for the u, and Xy terms, we look at the parts of Q that depend on pg 
and ip. We see that the result is 


(hp Erk) = dd log p(x;|O%) (11.29) 


-32 ral log [Eu] + (x: — Hp) Dy (Ki — wey) (11.30) 


This is just a weighted version of the standard problem of computing the MLEs of an MVN (see 
Section 4.1.3). One can show (Exercise 11.2) that the new parameter estimates are given by 
SS TikXi 


u, = es (11.31) 
Tk 


ik (Xi — i= T TRX, 
y, = Eramaa Dr yp m 
k k 
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These equations make intuitive sense: the mean of cluster k is just the weighted average of all 
points assigned to cluster k, and the covariance is proportional to the weighted empirical scatter 
matrix. 

After computing the new estimates, we set 0° = (mp, ux, Sx) for k = 1: K, and go to the 
next E step. 


Example 


An example of the algorithm in action is shown in Figure 11.11. We start with uw, = (—1, 1), 
X = I, py = (1,—1), X2 = I. We color code points such that blue points come from cluster 
l and red points from cluster 2. More precisely, we set the color to 


color(i) = rj,blue + rjgred (11.33) 


so ambiguous points appear purple. After 20 iterations, the algorithm has converged on a good 
clustering. (The data was standardized, by removing the mean and dividing by the standard 
deviation, before processing. This often helps convergence.) 


K-means algorithm 


There is a popular variant of the EM algorithm for GMMs known as the K-means algorithm, 
which we now discuss. Consider a GMM in which we make the following assumptions: X = 
o7Ip is fixed, and 7, = 1/K is fixed, so only the cluster centers, py, € RP, have to be 
estimated. Now consider the following delta-function approximation to the posterior computed 
during the E step: 


plzi = k|xi, 0) ~ I(k = 27) (11.34) 


where 2;* = argmax,, p(z; = k|x;,0). This is sometimes called hard EM, since we are making 
a hard assignment of points to clusters. Since we assumed an equal spherical covariance matrix 
for each cluster, the most probable cluster for x; can be computed by finding the nearest 


prototype: 
es . 2 
z; = arg min ||x; — mlli (11.35) 
Hence in each E step, we must find the Euclidean distance between N data points and K cluster 
centers, which takes O(N KD) time. However, this can be sped up using various techniques, 
such as applying the triangle inequality to avoid some redundant computations (Elkan 2003). 


Given the hard cluster assignments, the M step updates each cluster center by computing the 
mean of all points assigned to it: 


= — Sd) xi (11.36) 


See Algorithm 5 for the pseudo-code. 
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Figure 11.11 Illustration of the EM for a GMM applied to the Old Faithful data. (a) Initial (random) values 
of the parameters. (b) Posterior responsibility of each point computed in the first E step. The degree of 
redness indicates the degree to which the point belongs to the red cluster, and similarly for blue; this 
purple points have a roughly uniform posterior over clusters. (c) We show the updated parameters after 
the first M step. (d) After 3 iterations. (e) After 5 iterations. (f) After 16 iterations. Based on (Bishop 2006a) 
Figure 9.8. Figure generated by mixGaussDemoFaithful. 
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Algorithm 11.1: K-means algorithm 


1 initialize my; 


2 repeat 
3 Assign each data point to its closest cluster center: z; = arg ming, ||x; — 14;||3; 
4 Update each cluster center by computing the mean of all points assigned to it: 


=... ll = 
Hk = N, ae Xi, 
5 until converged; 


Figure 11.12 An image compressed using vector quantization with a codebook of size K. (a) K = 2. (b) 
K = 4. Figure generated by vqDemo. 


Vector quantization 


Since K-means is not a proper EM algorithm, it is not maximizing likelihood. Instead, it can be 
interpreted as a greedy algorithm for approximately minimizing a loss function related to data 
compression, as we now explain. 

Suppose we want to perform lossy compression of some real-valued vectors, x; € R?. A very 
simple approach to this is to use vector quantization or VQ. The basic idea is to replace each 
real-valued vector x; € RP with a discrete symbol z; € {1,..., K}, which is an index into a 
codebook of K prototypes, jz; € RP. Each data vector is encoded by using the index of the 
most similar prototype, where similarity is measured in terms of Euclidean distance: 


encode(x;) = arg min ||x; — yzl]? (11.37) 


We can define a cost function that measures the quality of a codebook by computing the 
reconstruction error or distortion it induces: 


, z| Ix, X) x; — decode(encode(x;)) 2 Xj — 2 (11.38) 
J(u Hz; 


i=l 


where decode(k) = ug. The K-means algorithm can be aa of as a simple iterative scheme 
for minimizing this objective. 

Of course, we can achieve zero distortion if we assign one prototype to every data vector, 
but that takes O(N DC') space, where N is the number of real-valued data vectors, each of 
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length D, and C is the number of bits needed to represent a real-valued scalar (the quantization 
accuracy). However, in many data sets, we see similar vectors repeatedly, so rather than storing 
them many times, we can store them once and then create pointers to them. Hence we can 
reduce the space requirement to O(N log, K + KDC): the O(N log, K) term arises because 
each of the N data vectors needs to specify which of the K codewords it is using (the pointers); 
and the O(K DC) term arises because we have to store each codebook entry, each of which is 
a D-dimensional vector. Typically the first term dominates the second, so we can approximate 
the rate of the encoding scheme (number of bits needed per object) as O(log, K), which is 
typically much less than O( DC). 

One application of VQ is to image compression. Consider the N = 200 x 320 = 64, 000 pixel 
image in Figure 11.12; this is gray-scale, so D = 1. If we use one byte to represent each pixel 
(a gray-scale intensity of 0 to 255), then C = 8, so we need NC = 512,000 bits to represent 
the image. For the compressed image, we need N log, K + KC bits. For K = 4, this is about 
128kb, a factor of 4 compression. For K = 8, this is about 192kb, a factor of 2.6 compression, 
at negligible perceptual loss (see Figure 11.12(b)). Greater compression could be achieved if we 
modelled spatial correlation between the pixels, e.g., if we encoded 5x5 blocks (as used by JPEG). 
This is because the residual errors (differences from the model's predictions) would be smaller, 
and would take fewer bits to encode. 


Initialization and avoiding local minima 


Both K-means and EM need to be initialized. It is common to pick K data points at random, and 
to make these be the initial cluster centers. Or we can pick the centers sequentially so as to try 
to “cover” the data. That is, we pick the initial point uniformly at random. Then each subsequent 
point is picked from the remaining points with probability proportional to its squared distance 
to the points’s closest cluster center. This is known as farthest point clustering (Gonzales 1985), 
or k-means++ (Arthur and Vassilvitskii 2007; Bahmani et al. 2012). Surprisingly, this simple trick 
can be shown to guarantee that the distortion is never more than O(log K) worse than optimal 
(Arthur and Vassilvitskii 2007). 

An heuristic that is commonly used in the speech recognition community is to incrementally 
“grow” GMMs: we initially give each cluster a score based on its mixture weight; after each 
round of training, we consider splitting the cluster with the highest score into two, with the new 
centroids being random perturbations of the original centroid, and the new scores being half of 
the old scores. If a new cluster has too small a score, or too narrow a variance, it is removed. 
We continue in this way until the desired number of clusters is reached. See (Figueiredo and 
Jain 2002) for a similar incremental approach. 


MAP estimation 


As usual, the MLE may overfit. The overfitting problem is particularly severe in the case of 
GMMs. To understand the problem, suppose for simplicity that ©, = o7J/, and that K = 2. It 
is possible to get an infinite likelihood by assigning one of the centers, say p, to a single data 
point, say xı, since then the Ist term makes the following contribution to the likelihood: 


1 0 


e (11.39) 
4/ 208 


N (x1 |M2, 031) = 
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Figure 11.13 (a) Illustration of how singularities can arise in the likelihood function of GMMs. Based on 
(Bishop 2006a) Figure 9.7. Figure generated by mixGaussSingularity. (b) Illustration of the benefit of 
MAP estimation vs ML estimation when fitting a Gaussian mixture model. We plot the fraction of times 
(out of 5 random trials) each method encounters numerical problems vs the dimensionality of the problem, 
for N = 100 samples. Solid red (upper curve): MLE. Dotted black (lower curve): MAP. Figure generated by 
mixGaussMLvsMAP. 


Hence we can drive this term to infinity by letting 72 — 0, as shown in Figure 11.13(a). We will 
call this the “collapsing variance problem”. 

An easy solution to this is to perform MAP estimation. The new auxiliary function is the 
expected complete data log-likelihood plus the log prior: 


Q'(6, 6%) = vm log Tik + dm log p(x;|0x)| + log p(a) + S logp( 0 X11.40) 


k 


Note that the E step remains unchanged, but the M step needs to be modified, as we now 
explain. 

For the prior on the mixture weights, it is natural to use a Dirichlet prior, m ~ Dir(a), since 
this is conjugate to the categorical distribution. The MAP estimate is given by 

Th = Tet O71 (11.41) 
N+ Jog ap — K 
If we use a uniform prior, a, = 1, this reduces to Equation 11.28. 

The prior on the parameters of the class conditional densities, p(@;,), depends on the form of 
the class conditional densities. We discuss the case of GMMs below, and leave MAP estimation 
for mixtures of Bernoullis to Exercise 11.3. 

For simplicity, let us consider a conjugate prior of the form 


Php, Uk) = NIW (ug, 2 &|Mo, Ko, vo, So) (11.42) 
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From Section 4.6.3, the MAP estimate is given by 


7 TkXk + KoMo 


_— 11.43 

Bi Tk + Ko ( 
(11.44) 

x, ê Di TikXi (11.45) 

Tk 
n So + Sx + “27 (Xp — mo) (Xk — mo)? 
“>, = £ notte | o)( 0) (11.46) 
mtret D+? 
Sk £ 5 Tik (xi = Xg) (xi = Xp)” (11.47) 


a 

We now illustrate the benefits of using MAP estimation instead of ML estimation in the 
context of GMMs. We apply EM to some synthetic data in D dimensions, using either ML or 
MAP estimation. We count the trial as a “failure” if there are numerical issues involving singular 
matrices. For each dimensionality, we conduct 5 random trials. The results are illustrated in 
Figure 11.13(b) using N = 100. We see that as soon as D becomes even moderately large, ML 
estimation crashes and burns, whereas MAP estimation never encounters numerical problems. 

When using MAP estimation, we need to specify the hyper-parameters. Here we mention 
some simple heuristics for setting them (Fraley and Raftery 2007, p163). We can set Ko = 0, 
so that the u, are unregularized, since the numerical problems only arise from “x. In this 


case, the MAP estimates simplify to (4, = X, and Š, = rare which is not quite so 
scary-looking. 

Now we discuss how to set So. One possibility is to use 

die as 

So = qap diag(si, ssp) (11.48) 
where s; = (1/N) Soy (rag — 7j)? is the pooled variance for dimension j. (The reason 
for the RUS term is that the resulting volume of each ellipsoid is then given by |So| = 
+|diag(s7,...,8%)|.) The parameter vo controls how strongly we believe this prior. The 


weakest prior we can use, while still being proper, is to set vo = D + 2, so this is a common 
choice. 


EM for mixture of experts 


We can fit a mixture of experts model using EM in a straightforward manner. The expected 
complete data log likelihood is given by 


N K 

Q(0,0”®) = 5 riz log[tinN (yil we Xi, o2)] (11.49) 
i=] k=1 

Tik  S(VTxi)k (11.50) 

Tik X TAN (yix wg, (og!4)?) (1.51) 


So the E step is the same as in a standard mixture model, except we have to replace mą with 
Ti, When computing rip. 
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In the M step, we need to maximize Q(a,0°“) wrt Wk, o? and V. For the regression 
parameters for model k, the objective has the form 


N 
1 
Q(Or, 0”) =X rik {Zu - wix)} (11.52) 


i=1 k 
We recognize this as a weighted least squares problem, which makes intuitive sense: if rip is 


small, then data point 7 will be downweighted when estimating model k’s parameters. From 
Section 8.3.4 we can immediately write down the MLE as 


wk = (X?R;,X) 1X7 Ruy (11.53) 
where R, = diag(r.,,). The MLE for the variance is given by 


N 
= Dini Mik (Yi — Wk Xi)? (11.54) 


N 
Dimi fik 
We replace the estimate of the unconditional mixing weights m with the estimate of the gating 
parameters, V. The objective has the form 


5 5 rik log Ti k (11.55) 
i k 


We recognize this as equivalent to the log-likelihood for multinomial logistic regression in 
Equation 8.34, except we replace the “hard” 1-of-C encoding y; with the “soft” 1-of-K encoding 
ri. Thus we can estimate V by fitting a logistic regression model to soft target labels. 


EM for DGMs with hidden variables 


We can generalize the ideas behind EM for mixtures of experts to compute the MLE or MAP 
estimate for an arbitrary DGM. We could use gradient-based methods (Binder et al. 1997), but it 
is much simpler to use EM (Lauritzen 1995): in the E step, we just estimate the hidden variables, 
and in the M step, we will compute the MLE using these filled-in values. We give the details 
below. 

For simplicity of presentation, we will assume all CPDs are tabular. Based on Section 10.4.2, 
let us write each CPT as follows: 


Kpa(t) Ke 


PltalXivaa) 91) = LI kee (11.56) 
c=1 k=1 


The log-likelihood of the complete data is given by 


V Kpa) Ki 


log p(D|0) = X` XO JO Nier log rcr (11.57) 


t=1 c=1 k=1 


where Nick = aw (tit = i, Xi pa(t) = C) are the empirical counts. Hence the expected 
complete data log-likelihood has the form 


i [log p(D|0)] = 2 2 a Nek log Otck (11.58) 
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where 
N 
Nick = 5 ) [Iti = 1, Xi pa(t) _ c)| = S plea = k, Xi pa(t) z c|Di) (11.59) 


i=l 


where D; are all the visible variables in case i. 

The quantity P(£it, Xi pa(t)|Di, 9) is known as a family marginal, and can be computed 
using any GM inference algorithm. The N;;; are the expected sufficient statistics, and constitute 
the output of the E step. 

Given these ESS, the M step has the simple form 


(11.60) 


This can be proved by adding Lagrange multipliers (to enforce the constraint $`, Otje = 1) 
to the expected complete data log likelihood, and then optimizing each parameter vector Ore 
separately. We can modify this to perform MAP estimation with a Dirichlet prior by simply 
adding pseudo counts to the expected counts. 


EM for the Student distribution * 


One problem with the Gaussian distribution is that it is sensitive to outliers, since the log- 
probability only decays quadratically with distance from the center. A more robust alternative is 
the Student t distribution, as discussed in Section ??. 

Unlike the case of a Gaussian, there is no closed form formula for the MLE of a Student, even 
if we have no missing data, so we must resort to iterative optimization methods. The easiest 
one to use is EM, since it automatically enforces the constraints that v is positive and that © 
is symmetric positive definite. In addition, the resulting algorithm turns out to have a simple 
intuitive form, as we see below. 

At first blush, it might not be apparent why EM can be used, since there is no missing data. 
The key idea is to introduce an “artificial” hidden or auxiliary variable in order to simplify the 
algorithm. In particular, we will exploit the fact that a Student distribution can be written as a 
Gaussian scale mixture: 


T(xi|u, £, v) = J N (x:lu, B/2)Ga(zil5, 5 de (1.6) 


(See Exercise 11.1 for a proof of this in the ld case.) This can be thought of as an “infinite” 
mixture of Gaussians, each one with a slightly different covariance matrix. 
Treating the z; as missing data, we can write the complete data log likelihood as 


N 
£.(0) = 5 [log N(x; |p, X/zi) + log Ga(z;|v/2, vy /2)| (11.62) 
i=1 
N 
D 1 Zi ; v i v i v 
= 2, -2 log(2r) — 5 log |X| 3 6; 5 log 5 log i) (11.63) 


+—(log zi — zi) 4 (5 1) log z; (11.64) 
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where we have defined the Mahalanobis distance to be 
6; = (xj — po) ET! (x; — p) (11.65) 


We can partition this into two terms, one involving p and ©, and the other involving v. We 
have, dropping irrelevant constants, 


0.(0) = Ly(u, ©) + Lel) (11.66) 
x, 1 pes 
Ln(m E£) = —ZNlog|E|— 5 X zð: (11.67) 
i=l 
1 Í = 
Lev) £ —Nlogr(v/2)+ 5 Nv log(v/2) t3” S (log zi — zi) (11.68) 


i=1 


EM with v known 


Let us first derive the algorithm with v assumed known, for simplicity. In this case, we can 
ignore the Lg term, so we only need to figure out how to compute E [z;] wrt the old parameters. 
From Section 4.6.2.2 we have 


D Ôi 
p(zi|xi, 0) = Ga(z:| = < ) (11.69) 
Now if z; ~ Ga(a, b), then E [z;] = a/b. Hence the E step at iteration t is 
OLD 
20) 2B [aa 0] -2 (1.70) 
v) + gÀ 
The M step is obtained by maximizing E [Ly (u, ©)] to yield 
s(t), 
Atty = Aa (1.71) 
“9 _ 1 SO zi (x: — a yo; — ADT 1172 
= ga iTA) ae”) (11.72) 


N 
il 
=N £ zP xixi — (>. e) pape 01.73) 


i=1 


These results are quite intuitive: the quantity Z; is the precision of measurement i, so if it is 
small, the corresponding data point is down-weighted when estimating the mean and covariance. 
This is how the Student achieves robustness to outliers. 


EM with v unknown 


To compute the MLE for the degrees of freedom, we first need to compute the expectation of 
La(v), which involves z; and log z;. Now if z; ~ Ga(a, b), then one can show that 


QO Ag [log 2/00 = W(a) —logb (11.74) 
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Figure 11.14 Mixture modeling on the bankruptcy data set. Left: Gaussian class conditional densities. 
Right: Student class conditional densities. Points that belong to class 1 are shown as triangles, points that 
belong to class 2 are shown as circles The estimated labels, based on the posterior probability of belonging 
to each mixture component, are computed. If these are incorrect, the point is colored red, otherwise it is 
colored blue. (Training data is in black.) Figure generated by mixStudentBankruptcyDemo. 


where U(x) = + logT (x) is the digamma function. Hence, from Equation 11.69, we have 


W = wot 2} joe ta” ) (1.75) 
= log(z) + w/ - 2 log 2) (11.76) 
Substituting into Equation 11.68, we have 
zlLo] = —Nlogl(v/2) + ~” tog(v/2) + 5 DG - 20) (1.77) 
The gradient of this expression is equal to 
“ E[Le(v)] = -Š u(/2) r3 A log(v/2) 4 a y se — Zl) (11.78) 


This has a unique solution in the interval (0, +00] which can be found using a ld constrained 
optimizer. 

Performing a gradient-based optimization in the M step, rather than a closed-form update, is 
an example of what is known as the generalized EM algorithm. One can show that EM will still 
converge to a local optimum even if we only perform a “partial” improvement to the parameters 
in the M step. 


Mixtures of Student distributions 


It is easy to extend the above methods to fit a mixture of Student distributions. See Exercise 11.4 
for the details. 

Let us consider a small example from (Lo 2009, ch3). We have a N = 66, D = 2 data 
set regarding the bankrupty patterns of certain companies. The first feature specifies the ratio 
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of retained earnings (RE) to total assets, and the second feature specifies the ratio of earnings 
before interests and taxes (EBIT) to total assets. We fit two models to this data, ignoring the 
class labels: a mixture of 2 Gaussians, and a mixture of 2 Students. We then use each fitted 
model to classify the data. We compute the most probable cluster membership and treat this 
as ĝi. We then compare %; to the true labels y; and compute an error rate. If this is more 
than 50%, we permute the latent labels (i.e., we consider cluster 1 to represent class 2 and vice 
versa), and then recompute the error rate. Points which are misclassified are then shown in red. 
The result is shown in Figure 11.14. We see that the Student model made 4 errors, the Gaussian 
model made 21. This is because the class-conditional densities contain some extreme values, 
causing the Gaussian to be a poor choice. 


EM for probit regression * 


In Section 9.4.2, we described the latent variable interpretation of probit regression. Recall that 
this has the form p(y; = 1|z:) = I(z; > 0), where z; ~ N(w?x;, 1) is latent. We now show 
how to fit this model using EM. (Although it is possible to fit probit regression models using 
gradient based methods, as shown in Section 9.4.1, this EM-based approach has the advantage 
that it generalized to many other kinds of models, as we will see later on.) 

The complete data log likelihood has the following form, assuming a M (0, Vo) prior on w: 


&(z,w|Vo) = logp(ylz) + log N (z|Xw, I) + log N (w 


0, Vo) (11.79) 
1 1 
= ye log p(yilzi) — zZ — Xw)" (z — Xw) — 53W Vow + const (11.80) 


The posterior in the E step is a truncated Gaussian: 


N (z;i|w? xi, II (z; > 0) if Yi = 1 


plzilyi xi, w) = { Nelen Witz <0) ify, =0 (1.81) 


In Equation 11.80, we see that w only depends linearly on z, so we just need to compute 
` [zilYi, Xi, w]. Exercise 11.15 asks you to show that the posterior mean is given by 


Omi) (Hi) P = 
it y= hit if y; = 

elm =| neo Prien kt Ane (11.82) 
i— Op) Hi Tou Nu 


where u; = wxi. 
In the M step, we estimate w using ridge regression, where u = E [z] is the output we are 
trying to predict. Specifically, we have 


w = (Vt + XTX) Xu (11.83) 


The EM algorithm is simple, but can be much slower than direct gradient methods, as 
illustrated in Figure 11.15. This is because the posterior entropy in the E step is quite high, since 
we only observe that z is positive or negative, but are given no information from the likelihood 
about its magnitude. Using a stronger regularizer can help speed convergence, because it 
constrains the range of plausible z values. In addition, one can use various speedup tricks, such 
as data augmentation (van Dyk and Meng 2001), but we do not discuss that here. 
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Figure 11.15 Fitting a probit regression model in 2d using a quasi-Newton method or EM. Figure generated 
by probitRegDemo. 


Theoretical basis for EM * 


In this section, we show that EM monotonically increases the observed data log likelihood until 
it reaches a local maximum (or saddle point, although such points are usually unstable). Our 
derivation will also serve as the basis for various generalizations of EM that we will discuss later. 


Expected complete data log likelihood is a lower bound 


Consider an arbitrary distribution q(z;) over the hidden variables. The observed data log 
likelihood can be written as follows: 


: x pi, 2:19) 
(8) =) [log [Eres 10) => log £ ao IO (11.84) 


= q(zi) 


Now log(u) is a concave function, so from Jensen’s inequality (Equation 2.113) we have the 
following lower bound: 


(0) > XOY giz) log Kenz (11.85) 


Gi(Zi) 
Let us denote this lower bound as follows: 


Q(9,q) = SY Eq [log p(x, z:|0)] + H (a) (11.86) 


i 


where H (q;) is the entropy of qi. 
The above argument holds for any positive distribution g. Which one should we choose? 
Intuitively we should pick the q that yields the tightest lower bound. The lower bound is a sum 
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over 7 of terms of the following form: 


19 Ad 0 
L(0,qi) = dl qi log Z8) (11.87) 
= Y ulzi) log? plas, OPCs 10) (11.88) 
qi(Zi) 
= 5 qi(z;) log neas + 2 qi(zi) log p(x;|@) (11.89) 
= -KL eine Xi, > + eg p(xi|9) (11.90) 


The p(x;|@) term is independent of qi, so we can maximize the lower bound by setting q;(z;) = 
p(zi|x;,). Of course, O is unknown, so instead we use q!(z;) = p(zi|x;, 0°), where 6° is our 
estimate of the parameters at iteration t. This is the output of the E step. 

Plugging this in to the lower bound we get 


Q(0,q') = JEg [log p(x:, 2:|6)] +H (a) (1.91) 


4 


We recognize the first term as the expected complete data log likelihood. The second term is a 
constant wrt @. So the M step becomes 


t+1 . t . . ; 
ot" = arg max Q(8, 6°) = arg max > | gt [log p(x, z:|0)] (11.92) 


as usual. 
Now comes the punchline. Since we used q!(z;) = p(z;|x;, 0"), the KL divergence becomes 
zero, so L(0*, qi) = log p(x;|0"), and hence 


Q(0', 6°) = X log p(x:|0*) = 40") (11.93) 


We see that the lower bound is tight after the E step. Since the lower bound “touches” the 
function, maximizing the lower bound will also “push up” on the function itself. That is, the 
M step is guaranteed to modify the parameters so as to increase the likelihood of the observed 
data (unless it is already at a local maximum). 

This process is sketched in Figure 11.16. The dashed red curve is the original function (the 
observed data log-likelihood). The solid blue curve is the lower bound, evaluated at 0t; this 
touches the objective function at 6°. We then set 6°*' to the maximum of the lower bound 
(blue curve), and fit a new bound at that point (dotted green curve). The maximum of this new 
bound becomes 6‘t?, etc. (Compare this to Newton's method in Figure 8.4(a), which repeatedly 
fits and then optimizes a quadratic approximation.) 


EM monotonically increases the observed data log likelihood 


We now prove that EM monotonically increases the observed data log likelihood until it reaches 
a local optimum. We have 


UO) > QCH, 6") > Q, 8°) = 6) eee 
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Figure 11.16 Illustration of EM as a bound optimization algorithm. Based on Figure 9.14 of (Bishop 2006a). 
Figure generated by emLogLikelihoodMax. 


where the first inequality follows since Q(0,-) is a lower bound on (0); the second inequality 
follows since, by definition, Q(6‘t', 0‘) = maxg Q(0, 6") > Q(6', 8"); and the final equality 
follows Equation 11.93. 

As a consequence of this result, if you do not observe monotonic increase of the observed 
data log likelihood, you must have an error in your math and/or code. (If you are performing 
MAP estimation, you must add on the log prior term to the objective.) This is a surprisingly 
powerful debugging tool. 


Online EM 


When dealing with large or streaming datasets, it is important to be able to learn online, as 
we discussed in Section 8.5. There are two main approaches to online EM in the literature. 
The first approach, known as incremental EM (Neal and Hinton 1998), optimizes the lower 
bound Q(0,q1,.--,qn) one q; at a time; however, this requires storing the expected sufficient 
statistics for each data case. The second approach, known as stepwise EM (Sato and Ishii 2000; 
Cappe and Mouline 2009; Cappe 2010), is based on stochastic approximation theory, and only 
requires constant memory use. We explain both approaches in more detail below, following the 
presentation of (Liang and Klein Liang and Klein). 


Batch EM review 


Before explaining online EM, we review batch EM in a more abstract setting. Let @(x,z) be a 
vector of sufficient statistics for a single data case. (For example, for a mixture of multinoullis, 
this would be the count vector a(j), which is the number of cluster 7 was used in z, plus the 
matrix B(j, v), which is of the number of times the hidden state was j and the observed letter 
was v.) Let s; = >>, p(z|x;,9)(x;,z) be the expected sufficient statistics for case i, and 
y= ya s; be the sum of the ESS. Given yz, we can derive an ML or MAP estimate of the 
parameters in the M step; we will denote this operation by (u). (For example, in the case of 
mixtures of multinoullis, we just need to normalize a and each row of B.) With this notation 
under our belt, the pseudo code for batch EM is as shown in Algorithm 8. 


11.4.8.2 


1.4.8.3 


366 Chapter 11. Mixture models and the EM algorithm 


Algorithm 11.2: Batch EM algorithm 


1 initialize p; 


2 repeat 

3 eee = 0 : 

4 for each example i = 1 : N do 

5 | si = D1, pP(z|x: O(M)) P(%i, 2) ; 
6 pee = pee + Si; : 

7 b= — ew: 


8 until converged; 


Incremental EM 


In incremental EM (Neal and Hinton 1998), we keep track of js as well as the s;. When we come 
to a data case, we swap out the old s; and replace it with the new s/’"’, as shown in the code 
in Algorithm 8. Note that we can exploit the sparsity of s?“”’ to speedup the computation of 0, 
since most components of yz wil not have changed. 


Algorithm 11.3: Incremental EM algorithm 

1 initialize s; for i = 1 : N; 

2 u= D Sz; 

3 repeat 

4 for each example i = 1: N in a random order do 
sp" : =), p(z|xi, 0 (u ))e (Xi, Z) ; 


5 
6 u Spt sa 
7 


S; := Sew 


8 until converged; 


This can be viewed as maximizing the lower bound Q(6,q,...,qn) by optimizing qı, then 
0, then qə, then @, etc. As such, this method is guaranteed to monotonically converge to a local 
maximum of the lower bound and to the log likelihood itself. 


Stepwise EM 


In stepwise EM, whenever we compute a new s;, we move p towards it, as shown in Algorithm 7.” 
At iteration k, the stepsize has value nę, which must satisfy the Robbins-Monro conditions in 
Equation 8.82. For example, (Liang and Klein Liang and Klein) use nę = (2 + k)~" for 
0.5 <x < 1. We can get somewhat better behavior by using a minibatch of size m before 
each update. It is possible to optimize m and « to maximize the training set likelihood, by 


2. A detail: As written the update for pz does not exploit the sparsity of s;. We can fix this by storing m = iS 
J<k J 


si. This will not affect the results (i.e., 


instead of ys, and then using the sparse update m := m 4 Tk — 
Mjek G-nj) 
0(44) = O(m)), since scaling the counts by a global constant has no effect. 


11.4.9 


11.4. The EM algorithm 367 


$ ey 
SS ae 
~-L--- 
= 1 1 
ES 1 aE, ‘ 
s ` 
2 i a T 
So / 
1 LA 
s , 
~ ‘ 
SLU 
sf 
3 o i : 
` ' e F 
` 1 Pe 
N 1 Paid 17 
L z wW 
Sop F 
g z 
i 


Figure 11.17 Illustration of deterministic annealing. Based on http://en.wikipedia.org/wiki/Grad 
uated_optimization. 


trying different values in parallel for an initial trial period; this can significantly speed up the 
algorithm. 


Algorithm 11.4: Stepwise EM algorithm 


1 initialize w; k = 0 ; 


2 repeat 

3 for each example i = 1 : N in a random order do 
4 s; = D1, P(2|xi, O(M)) P(xi, z2) ; 

5 = (1 — nk) M+ NSi; 

6 k:=k+1 


7 until converged; 


(Liang and Klein Liang and Klein) compare batch EM, incremental EM, and stepwise EM 
on four different unsupervised language modeling tasks. They found that stepwise EM (using 
k & 0.7 and m & 1000) was faster than incremental EM, and both were much faster than batch 
EM. In terms of accuracy, stepwise EM was usually as good or sometimes even better than batch 
EM; incremental EM was often worse than either of the other methods. 


Other EM variants * 


EM is one of the most widely used algorithms in statistics and machine learning. Not surpris- 
ingly, many variations have been proposed. We briefly mention a few below, some of which we 
will use in later chapters. See (McLachlan and Krishnan 1997) for more information. 


e Annealed EM In general, EM will only converge to a local maximum. To increase the chance 
of finding the global maximum, we can use a variety of methods. One approach is to use 
a method known as deterministic annealing (Rose 1998). The basic idea is to “smooth” 
the posterior “landscape” by raising it to a temperature, and then gradually cooling it, all the 
while slowly tracking the global maximum. See Figure 11.17. for a sketch. (A stochastic version 
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true log-likelihood 
true log-likelihood 


training time training time 


(a) (b) 


Figure 11.18 Illustration of possible behaviors of variational EM. (a) The lower bound increases at each 
iteration, and so does the likelihood. (b) The lower bound increases but the likelihood decreases. In 
this case, the algorithm is closing the gap between the approximate and true posterior. This can have a 
regularizing effect. Based on Figure 6 of (Saul et al. 1996). Figure generated by varEMbound. 


of this algorithm is described in Section 24.6.1.) An annealed version of EM is described in 
(Ueda and Nakano 1998). 

e Variational EM In Section 11.4.7, we showed that the optimal thing to do in the E step is to 
make q; be the exact posterior over the latent variables, qt (z;) = p(zi|x;, 0°). In this case, 
the lower bound on the log likelihood will be tight, so the M step will “push up” on the 
log-likelihood itself. However, sometimes it is computationally intractable to perform exact 
inference in the E step, but we may be able to perform approximate inference. If we can 
ensure that the E step is performing inference based on a a lower bound to the likelihood, 
then the M step can be seen as monotonically increasing this lower bound (see Figure 11.18). 
This is called variational EM (Neal and Hinton 1998). See Chapter 21 for some variational 
inference methods that can be used in the E step. 

e Monte Carlo EM Another approach to handling an intractable E step is to use a Monte 
Carlo approximation to the expected sufficient statistics. That is, we draw samples from the 
posterior, zê ~ p(z;|x;, 0"), and then compute the sufficient statistics for each completed 
vector, (x;, Z$), and then average the results. This is called Monte Carlo EM or MCEM (Wei 
and Tanner 1990). (If we only draw a single sample, it is called stochastic EM (Celeux and 
Diebolt 1985).) One way to draw samples is to use MCMC (see Chapter 24). However, if we 
have to wait for MCMC to converge inside each E step, the method becomes very slow. An 
alternative is to use stochastic approximation, and only perform “brief” sampling in the E 
step, followed by a partial parameter update. This is called stochastic approximation EM 
(Delyon et al. 1999) and tends to work better than MCEM. Another alternative is to apply 
MCMC to infer the parameters as well as the latent variables (a fully Bayesian approach), thus 
eliminating the distinction between E and M steps. See Chapter 24 for details. 

e Generalized EM Sometimes we can perform the E step exactly, but we cannot perform the 
M step exactly. However, we can still monotonically increase the log likelihood by performing 
a “partial” M step, in which we merely increase the expected complete data log likelihood, 
rather than maximizing it. For example, we might follow a few gradient steps. This is called 
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Figure 11.19 Illustration of adaptive over-relaxed EM applied to a mixture of 5 Gaussians in 15 dimensions. 
We show the algorithm applied to two different datasets, randomly sampled from a mixture of 10 Gaussians. 
We plot the convergence for different update rates 7. Using 77 = 1 gives the same results as regular EM. 
The actual running time is printed in the legend. Figure generated by mixGaussOverRelaxedEmDemo. 


the generalized EM or GEM algorithm. (This is an unfortunate term, since there are many 
ways to generalize EM....) 

e ECM(E) algorithm The ECM algorithm stands for “expectation conditional maximization’, 
and refers to optimizing the parameters in the M step sequentially, if they turn out to be 
dependent. The ECME algorithm, which stands for “ECM either” (Liu and Rubin 1995), is 
a variant of ECM in which we maximize the expected complete data log likelihood (the Q 
function) as usual, or the observed data log likelihood, during one or more of the conditional 
maximization steps. The latter can be much faster, since it ignores the results of the E step, 
and directly optimizes the objective of interest. A standard example of this is when fitting 
the Student T distribution. For fixed v, we can update © as usual, but then to update v, 
we replace the standard update of the form v’+! = arg max, Q((u't!, Bt, v), 6") with 
ttl = arg max, log p(D|u't!, 5+}, v). See (McLachlan and Krishnan 1997) for more 
information. 

e Over-relaxed EM Vanilla EM can be quite slow, especially if there is lots of missing data. The 
adaptive overrelaxed EM algorithm (Salakhutdinov and Roweis 2003) performs an update 
of the form 6‘** = 9° + n(M(0") — 6°), where n is a step-size parameter, and M (0*) is 
the usual update computed during the M step. Obviously this reduces to standard EM if 
7 = 1, but using larger values of 7 can result in faster convergence. See Figure 11.19 for an 
illustration. Unfortunately, using too large a value of 7 can cause the algorithm to fail to 
converge. 


Finally, note that EM is in fact just a special case of a larger class of algorithms known as 
bound optimization or MM algorithms (MM stands for minorize-maximize). See (Hunter and 
Lange 2004) for further discussion. 


11.5 


11.5.1 


11.5.2 
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Model selection for latent variable models 


When using LVMs, we must specify the number of latent variables, which controls the model 
complexity. In particuarl, in the case of mixture models, we must specify K, the number 
of clusters. Choosing these parameters is an example of model selection. We discuss some 
approaches below. 


Model selection for probabilistic models 


The optimal Bayesian approach, discussed in Section 5.3, is to pick the model with the largest 
marginal likelihood, K* = argmax, p(D| K). 

There are two problems with this. First, evaluating the marginal likelihood for LVMs is 
quite difficult. In practice, simple approximations, such as BIC, can be used (see e.g., (Fraley 
and Raftery 2002)). Alternatively, we can use the cross-validated likelihood as a performance 
measure, although this can be slow, since it requires fitting each model F times, where F' is the 
number of CV folds. 

The second issue is the need to search over a potentially large number of models. The usual 
approach is to perform exhaustive search over all candidate values of K. However, sometimes 
we can set the model to its maximal size, and then rely on the power of the Bayesian Occam’s 
razor to “kill off” unwanted components. An example of this will be shown in Section 21.6.1.6, 
when we discuss variational Bayes. 

An alternative approach is to perform stochastic sampling in the space of models. Traditional 
approaches, such as (Green 1998, 2003; Lunn et al. 2009), are based on reversible jump MCMC, 
and use birth moves to propose new centers, and death moves to kill off old centers. However, 
this can be slow and difficult to implement. A simpler approach is to use a Dirichlet process 
mixture model, which can be fit using Gibbs sampling, but still allows for an unbounded number 
of mixture components; see Section 25.2 for details. 

Perhaps surprisingly, these sampling-based methods can be faster than the simple approach 
of evaluating the quality of each K separately. The reason is that fitting the model for each 
K is often slow. By contrast, the sampling methods can often quickly determine that a certain 
value of K is poor, and thus they need not waste time in that part of the posterior. 


Model selection for non-probabilistic methods 


What if we are not using a probabilistic model? For example, how do we choose K for the K- 
means algorithm? Since this does not correspond to a probability model, there is no likelihood, 
so none of the methods described above can be used. 

An obvious proxy for the likelihood is the reconstruction error. Define the squared recon- 
struction error of a data set D, using model complexity K, as follows: 


E(D, K) = ap Ix: — %|? (11.95) 
iED 
In the case of K-means, the reconstruction is given by x; = ,,, where z; = argmin, ||x; — 
H;,||3, as explained in Section 11.4.2.6. 
Figure 11.20(a) plots the reconstruction error on the fest set for K-means. We notice that the 
error decreases with increasing model complexity! The reason for this behavior is as follows: 
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Figure 11.20 Test set performance vs K for data generated from a mixture of 3 
shown in Figure 11.21(a)). (a) MSE on test set for K-means. (b) Negative log likeliho 
Figure generated by kmeansModelSelid. 
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Figure 11.21 Synthetic data generated from a mixture of 3 Gaussians in 1d. (a) Histogram of training data. 
(Test data looks essentially the same.) (b) Centroids estimated by K-means for K € {2,3,4,5,6, 10}. 


(c) GMM density model estimated by EM for for the same values of K. 
kmeansModelSelid. 


Figure generated by 
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when we add more and more centroids to k-means, we can “tile” the space more densely, as 
shown in Figure 11.21(b). Hence any given test vector is more likely to find a close prototype to 
accurately represent it as K increases, thus decreasing reconstruction error. However, if we use 
a probabilistic model, such as the GMM, and plot the negative log-likelihood, we get the usual 
U-shaped curve on the test set, as shown in Figure 11.20(b). 

In supervised learning, we can always use cross validation to select between non-probabilistic 
models of different complexity, but this is not the case with unsupervised learning. Although 
this is not a novel observation (e.g., it is mentioned in passing in (Hastie et al. 2009, p519), one 
of the standard references in this field), it is perhaps not as widely appreciated as it should be. 
In fact, it is one of the more compelling arguments in favor of probabilistic models. 

Given that cross validation doesn’t work, and supposing one is unwilling to use probabilistic 
models (for some bizarre reason...), how can one choose K? The most common approach is to 
plot the reconstruction error on the training set versus K, and to try to identify a knee or kink 
in the curve. The idea is that for K < K*, where K* is the “true” number of clusters, the rate 
of decrease in the error function will be high, since we are splitting apart things that should 
not be grouped together. However, for K > K*, we are splitting apart “natural” clusters, which 
does not reduce the error by as much. 

This kink-finding process can be automated by use of the gap statistic (Tibshirani et al. 
2001). Nevertheless, identifying such kinks can be hard, as shown in Figure 11.20(a), since the 
loss function usually drops off gradually. A different approach to “kink finding” is described in 
Section 12.3.2.1. 


Fitting models with missing data 


Suppose we want to fit a joint density model by maximum likelihood, but we have “holes” in our 
data matrix, due to missing data (usually represented by NaNs). More formally, let O,; = 1 if 
component j of data case i is observed, and let O,; = 0 otherwise. Let X, = {x;; : Oi; = 1} 
be the visible data, and X, = {x;; : Oi; = 0} be the missing or hidden data. Our goal is to 
compute 


ô = argmax p(X, |0, O) (11.96) 
0 
Under the missing at random assumption (see Section 8.6.2), we have 
p(X1|@, O) To P(Xiv|9) (11.97) 


where x;, is a vector created from row 7 and the columns indexed by the set {j : Oi; = 1}. 
Hence the log-likelihood has the form 


log p(X,|9) =L log p(xivl0) (11.98) 
where 
P(Xivl®) = XC (Xin, Xinl@) (11.99) 


Xih 


11.6.1 


11.6.1.1 


11.6.1.2 
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and xj, is the vector of hidden variables for case i (assumed discrete for notational simplicity). 
Substituting in, we get 


log p(X. |@) Da En Xiv, X xalo) (11.100) 


Unfortunately, this PA is hard to maximize. since we cannot push the log inside the sum. 
However, we can use the EM algorithm to compute a local optimum. We give an example of 
this below. 


EM for the MLE of an MVN with missing data 
Suppose we want to fit an MVN by maximum likelihood, but we have missing data. We can use 
EM to find a local maximum of the objective, as we explain below. 


Getting started 


To get the algorithm started, we can compute the MLE based on those rows of the data ma- 
trix that are fully observed. If there are no such rows, we can use some ad-hoc imputation 
procedures, and then compute an initial MLE. 


E step 


Once we have 0*7}, we can compute the expected complete data log likelihood at iteration t as 
follows: 


N 
Q(0,0°"") = E/S log N(xilu, 5)|D,6°* (11.101) 
i=l 
oe log |27 | — DD E [(x; — p) ET (x; — w)] (11.102) 
2 24 : 
= = i = Pals ‘STE [0 — u) (x: — p)” (11.103) 
2 2 - a 4 u s 
= — > 1og|3] 2 AP tog(2n) 7 E LE [S(t)]) (11.104) 
where 
JEMES ( 2 [x:xT] + uu? — 24E bx”) (11.105) 


(We drop the conditioning of the expectation on D and 0*7" for brevity.) We see that we need 
to compute )>, E [x;] and X5; E [x;x/"]; these are the expected sufficient statistics. 

To compute these quantities, we use the results from Section 4.3.1. Specifically, consider case 
i, where components v are observed and components h are unobserved. We have 


Xin|Xiv,9 ~ N (mi, Vi) (11.106) 
m; = Hn + Dro Ziyo (Ro Ho) (11.107) 
Vi = Err- Ewy Loh (11.108) 


11.6.1.3 
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Hence the expected sufficient statistics are 


3 [x;] = (E [xin] ; Xiv) = (mG; Xiv) (11.109) 


where we have assumed (without loss of generality) that the unobserved variables come before 
the observed variables in the node ordering. 


To compute E [x;xT], we use the result that cov [x] = E [xxT] — E [x] E [xT]. Hence 
i [xxT] = “ih ) (xT xZ)| = 3 [xinxin] E Pein] xio (11.110) 
iXi Xiv ih iv XivE [xin] XivXt, l 
2 [xinx,] = E[xn]E [xa] +V; (1.1) 
M step 


By solving VQ(90, ef) = 0, we can show that the M step is equivalent to plugging these 
ESS into the usual MLE equations to get 


1 
t _ oe 
bo = NW > Xi] (11.112) 


1 È; 
z = g) E [þr] — a’)? 01.113) 


Thus we see that EM is not equivalent to simply replacing variables by their expectations and 
applying the standard MLE formula; that would ignore the posterior variance and would result 
in an incorrect estimate. Instead we must compute the expectation of the sufficient statistics, 
and plug that into the usual equation for the MLE. We can easily modify the algorithm to 
perform MAP estimation, by plugging in the ESS into the equation for the MAP estimate. For an 
implementation, see gaussMissingFitEm. 


Example 


As an example of this procedure in action, let us reconsider the imputation problem from 
Section 4.3.2.3, which had N = 100 10-dimensional data cases, with 50% missing data. Let 
us fit the parameters using EM. Call the resulting parameters 9. We can use our model for 


predictions by computing E [xin le, ô. Figure 11.22(a-b) indicates that the results obtained 


using the learned parameters are almost as good as with the true parameters. Not surprisingly, 
performance improves with more data, or as the fraction of missing data is reduced. 


Extension to the GMM case 

It is straightforward to fit a mixture of Gaussians in the presence of partially observed data 
vectors x;. We leave the details as an exercise. 

Exercises 


Exercise 11.1 Student T as infinite mixture of Gaussians 
Derive Equation 11.61. For simplicity, assume a one-dimensional distribution. 
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Figure 11.22 Illustration of data imputation. (a) Scatter plot of true values vs imputed values us- 
ing true parameters. (b) Same as (b), but using parameters estimated with EM. Figure generated by 
gaussImputationDemo. 


Exercise 11.2 EM for mixtures of Gaussians 
Show that the M step for ML estimation of a mixture of Gaussians is given by 
.TikXi 
H, = Doi M&K: (11.114) 
Tk 


g, = Lain = M) (i = M)” _ Cy raXx] = remot 
Tk Tk 


(11.015) 


Exercise 11.3 EM for mixtures of Bernoullis 
e Show that the M step for ML estimation of a mixture of Bernoullis is given by 
ne = Fa TikTij 
Š Drik 


e Show that the M step for MAP estimation of a mixture of Bernoullis with a (a, 8) prior is given by 


Qo: Tinvij) ta-1 
(dO, Tin) ta+B-2 


(11.16) 


Uki (11.17) 


Exercise 11.4 EM for mixture of Student distributions 


Derive the EM algorithm for ML estimation of a mixture of multivariate Student T distributions. 


Exercise 11.5 Gradient descent for fitting GMM 


Consider the Gaussian mixture model 


p(x|8) = So mN (ler, Ex) 1.118) 
k 


Define the log likelihood as 


N 
(0) = X` log p(xn]0) (11.119) 


n=1 
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Figure 11.23 A mixture of Gaussians with two discrete latent indicators. J; specifies which mean to use, 
and Kn specifies which variance to use. 


Define the posterior responsibility that cluster k has for datapoint n as follows: 


TKN (Xn| Hp, De) 


A 
Tak Z p(Zn = k|Xn, 0) = (11.120) 
Ss TIN (Xn| Mar, Ee’) 
a. Show that the gradient of the log-likelihood wrt p, is 
d = 
g0) = So rak Ek (Xn — My) (11.121) 
My 


b. Derive the gradient of the log-likelihood wrt mp. (For now, ignore any constraints on 7mp.) 


c. One way to handle the constraint that SS Tk = 1 is to reparameterize using the softmax function: 


evr 
Th a =r (11.122) 
erat eve 
Here wz € R are unconstrained parameters. Show that 
d 
Foe) = 2 Tnk — Tk (11.123) 


(There may be a constant factor missing in the above expression...) Hint: use the chain rule and the 
fact that 


drj _ rj(l— rj) ifj=k 
dw, | —1jTk ifj7 Ak 


which follows from Exercise 8.4(1). 


(11.124) 


d. Derive the gradient of the log-likelihood wrt ®©. (For now, ignore any constraints on Sx.) 


e. One way to handle the constraint that X4 be a symmetric positive definite matrix is to reparame- 
terize using a Cholesky decomposition, ©; = R R, where R is an upper-triangular, but otherwise 
unconstrained matrix. Derive the gradient of the log-likelihood wrt Rx. 


Exercise 11.6 EM for a finite scale mixture of Gaussians 
(Source: Jaakkola..) Consider the graphical model in Figure 11.23 which defines the following: 


m l 
Oe bs nN (anus, aÈ) aus 
k=l 


j=1 
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where 6 = {pi,...,DPmy H1,- +y Hm, q1; ++ +5 Qs O15-++, 07} are all the parameters. Here p; & P(Jn = 


j) and qx = P(Kn = k) are the equivalent of mixture weights. We can think of this as a mixture 
of m non-Gaussian components, where each component distribution is a scale mixture, p(2|j;@) = 


D qk N(x; uj, o2), combining Gaussians with different variances (scales). 


We will now derive a generalized EM algorithm for this model. (Recall that in generalized EM, we do a 
partial update in the M step, rather than finding the exact maximum.) 


a. Derive an expression for the responsibilities, P(Jn = j, Kn = k|£n, 0), needed for the E step. 


b. Write out a full expression for the expected complete log-likelihood 


N 
0" 2°") = Bye X log P(Jn, Kn, £n|0”®®) (11.126) 

n=1 
c. Solving the M-step would require us to jointly optimize the means j11,...,/¢m and the variances 
o7,..., 07. It will turn out to be simpler to first solve for the j1;’s given fixed o7’s, and subsequently 


solve for o;'8 given the new values of pj's. For brevity, we will just do the first part. Derive an 


expression for the maximizing j1;’s given fixed o7.;, i.e., solve = 0. 


dQ 
Iprew 


Exercise 11.7 Manual calculation of the M step for a GMM 


(Source: de Freitas.) In this question we consider clustering 1D data with a mixture of 2 Gaussians using 
the EM algorithm. You are given the 1-D data points x = [1 10 20]. Suppose the output of the E 
step is the following matrix: 


1 0 
R= | 04 0.6 (11.127) 
0 1 


where entry fi, c is the probability of obervation x; belonging to cluster c (the responsibility of cluster c for 
data point 7). You just have to compute the M step. You may state the equations for maximum likelihood 
estimates of these quantities (which you should know) without proof; you just have to apply the equations 
to this data set. You may leave your answer in fractional form. Show your work. 


a. Write down the likelihood function you are trying to optimize. 
b. After performing the M step for the mixing weights 71,72, what are the new values? 


c. After performing the M step for the means pı and u2, what are the new values? 


Exercise 11.8 Moments of a mixture of Gaussians 


Consider a mixture of K Gaussians 


K 
p(x) = XO mN (x| Me, De) (11.128) 
k=l 
a. Show that 


E [x] = X` rep, (11.129) 
k 
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Figure 11.24 Some data points in 2d. Circles represent the initial guesses for mı and mz. 


b. Show that 


cov [x] = YO mE + yee] E EXE [x] 


Hint: use the fact that cov [x] = E [xx”] — E [x] E [x]. 


Exercise 11.9 K-means clustering by hand 


(Source: Jaakkola.) 


(11.130) 


In Figure 11.24, we show some data points which lie on the integer grid. (Note that the x-axis has been 
compressed; distances should be measured using the actual grid coordinates.) Suppose we apply the K- 
means algorithm to this data, using K = 2 and with the centers initialized at the two circled data points. 
Draw the final clusters obtained after K-means converges (show the approximate location of the new centers 
and group together all the points assigned to each center). Hint: think about shortest Euclidean distance. 


Exercise 11.10 Deriving the K-means cost function 
Show that 


jw=5>. D D a= Dom Yea) 


k=1 i:z;=k i!:2z,,;=k k= 


= 
a 
N 

rn 
> 


Hint: note that, for any y, 
Di - HP = Dolle- z) - (we -2)P 
| = DA -7) +) Œ- u)? -2X (ei — B)(u - 7) 
= nei l 


1 n 


2_ o i 
where sf = = )0"_ (ai — 2)”, since 


Exercise 11.11 Visible mixtures of Gaussians are in the exponential family 


(11.13) 


(11.132) 


(11.133) 


(11.134) 


(11.135) 


Show that the joint distribution p(x, z|@) for a 1d GMM can be represented in exponential family form. 
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regression with censored data; red x = censored, green * = predicted 
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Figure 11.25 Example of censored linear regression. Black circles are observed training points, red crosses 
are observed but censored training points. Green stars are predicted values of the censored training points. 
We also show the lines fit by least squares (ignoring censoring) and by EM. Based on Figure 5.6 of (Tanner 
1996). Figure generated by linregCensoredSchmeeHahnDemo, written by Hannes Bretschneider. 


Exercise 11.12 EM for robust linear regression with a Student t likelihood 


Consider a model of the form 
plyilxi, W, 07, v) = T(yilw? xi, 07, v) (11.136) 


Derive an EM algorithm to compute the MLE for w. You may assume v and g? are fixed, for simplicity. 
Hint: see Section 11.4.5. 


Exercise 11.13 EM for EB estimation of Gaussian shrinkage model 


Extend the results of Section 5.6.2.2 to the case where the o? are not equal (but are known). Hint: treat 


the 0; as hidden variables, and then to integrate them out in the E step, and maximize 7 = (u, 7”) in the 
M step. 


Exercise 11.14 EM for censored linear regression 


Censored regression refers to the case where one knows the outcome is at least (or at most) a certain 
value, but the precise value is unknown. This arises in many different settings. For example, suppose one 
is trying to learn a model that can predict how long a program will take to run, for different settings of 
its parameters. One may abort certain runs if they seem to be taking too long; the resulting run times are 
said to be right censored. For such runs, all we know is that y; > ci, where c; is the censoring time, 
that is, y; = min(z:,c;), where z; is the true running time and y; is the observed running time. We 
can also define left censored and interval censored models.’ Derive an EM algorithm for fitting a linear 
regression model to right-censored data. Hint: use the results from Exercise 11.15. See Figure 11.25 for an 
example, based on the data from (Schmee and Hahn 1979). We notice that the EM line is tilted upwards 
more, since the model takes into account the fact that the truncated values are actually higher than the 
observed values. 


3. There is a closely related model in econometrics called the Tobit model, in which y; = max(z;,0), so we only 
get to observe positive outcomes. An example of this is when z; represents “desired investment”, and y; is actual 
investment. Probit regression (Section 9.4) is another example. 
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Exercise 11.15 Posterior mean and variance of a truncated Gaussian 


Let zi = Hi + oci, where ei ~ N(0,1). Sometimes, such as in probit regression or censored regression, 
we do not observe z;, but we observe the fact that it is above some threshold, namely we observe the event 
B=(W2>c)=l(a > —_ ). (See Exercise 11.14 for details on censored regression, and Section 11.4.6 


for probit regression.) Show that 


Bll >i] = wi +oH (4) (11137) 
Oo 
and 
2 = 2 2 Ci — Hi 
E [ela >c] = wtor+olatm)H (=”) (11.138) 
(on 


where we have defined 


A (u) 
H(u) = aa P) (11.139) 
and where ¢(u) is the pdf of a standard Gaussian, and ®(u) is its cdf. 
Hint 1: we have p(«;| E) = ® eet, where E is some event of interest. 
Hint 2: It can be shown that 
d 
gg N (%0, 1) = —wN(w0, 1) (11.140) 


and hence 


7 * wAV(wl0, 1) =.N(b|0, 1) — N’(cl0, 1) a4) 


b 


12.1 


12.1.1 


Latent linear models 


Factor analysis 


One problem with mixture models is that they only use a single latent variable to generate the 
observations. In particular, each observation can only come from one of prototypes. One can 
think of a mixture model as using K hidden binary variables, representing a one-hot encoding 
of the cluster identity. But because these variables are mutually exclusive, the model is still 
limited in its representational power. 

An alternative is to use a vector of real-valued latent variables, z; € R’. The simplest prior 
to use is a Gaussian (we will consider other choices later): 


p(zi) = N(zi| Mo, Eo) (12.1) 


If the observations are also continuous, so x; € RP, we may use a Gaussian for the likelihood. 
Just as in linear regression, we will assume the mean is a linear function of the (hidden) inputs, 
thus yielding 


p(xi|zi,0) = N(W2; + u, Y) (12.2) 


where W is a D x L matrix, known as the factor loading matrix, and W is a D x D covariance 
matrix. We take W to be diagonal, since the whole point of the model is to “force” z; to explain 
the correlation, rather than “baking it in” to the observation’s covariance. This overall model 
is called factor analysis or FA. The special case in which W = o7I is called probabilistic 
principal components analysis or PPCA. The reason for this name will become apparent later. 

The generative process, where L = 1, D = 2 and W is diagonal, is illustrated in Figure 12.1. 
We take an isotropic Gaussian “spray can” and slide it along the ld line defined by wz; + p. 
This induces an ellongated (and hence correlated) Gaussian in 2d. 


FA is a low rank parameterization of an MVN 


FA can be thought of as a way of specifying a joint density model on x using a small number 
of parameters. To see this, note that from Equation 4.126, the induced marginal distribution 
p(x,;|@) is a Gaussian: 


pal = J N(xi[ Wa; + p, BN (sil tty, Eojdzi (12.3) 


= N(x;|Wyy + u, ¥ + Wd WwW") (12.4) 


12.1.2 
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Figure 12.1 Illustration of the PPCA generative process, where we have L = 1 latent dimension generating 
D = 2 observed dimensions. Based on Figure 12.9 of (Bishop 2006b). 


From this, we see that we can set po = O without loss of generality, since we can always absorb 
W uo into u. Similarly, we can set No = I without loss of generality, because we can always 
1 


“emulate” a correlated prior by using defining a new weight matrix, W = WX’. Then we 


find 
cov [x|6] = WT +E [ee™] = (WE5 ?)E0(WE, 7)? + Y = WWT +Y (12.5) 


We thus see that FA approximates the covariance matrix of the visible vector using a low-rank 
decomposition: 


C £ cov [x] = WWT + © (12.6) 


This only uses O(LD) parameters, which allows a flexible compromise between a full covariance 
Gaussian, with O(D?) parameters, and a diagonal covariance, with O(D) parameters. Note that 
if we did not restrict W to be diagonal, we could trivially set © to a full covariance matrix; then 
we could set W = 0, in which case the latent factors would not be required. 


Inference of the latent factors 


Although FA can be thought of as just a way to define a density on x, it is often used because 
we hope that the latent factors z will reveal something interesting about the data. To do this, 
we need to compute the posterior over the latent factors. We can use Bayes rule for Gaussians 
to give 


p(zi|xi,0) = N(zi|m;, X:;) (12.7) 
D 2 (+W tw)? (12.8) 
m = (WTE t(x — u) + Euo) (12.9) 


Note that in the FA model, ®©; is actually independent of i, so we can denote it by X. Computing 
this matrix takes O(L? + L? D) time, and computing each m; = E [z;|x;, 0] takes O(L? + LD) 
time. The m; are sometimes called the latent scores, or latent factors. 
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Figure 12.2 2D projection of 2004 cars data based on factor analysis. The blue text are the names of cars 
corresponding to certain chosen points. Figure generated by faBiplotDemo. 


Let us give a simple example, based (Shalizi 2009). We consider a dataset of D = 11 variables 
and N = 387 cases describing various aspects of cars, such as the engine size, the number of 
cylinders, the miles per gallon (MPG), the price, etc. We first fit a L = 2 dimensional model. We 
can plot the m; scores as points in R?, to visualize the data, as shown in Figure 12.2. 

To get a better understanding of the “meaning” of the latent factors, we can project unit vectors 
corresponding to each of the feature dimensions, e = (1,0,...,0), e2 = (0,1,0,...,0), ete. 
into the low dimensional space. These are shown as blue lines in Figure 12.2; this is known as 
a biplot. We see that the horizontal axis represents price, corresponding to the features labeled 
“dealer” and “retail”, with expensive cars on the right. The vertical axis represents fuel efficiency 
(measured in terms of MPG) versus size: heavy vehicles are less efficient and are higher up, 
whereas light vehicles are more efficient and are lower down. We can “verify” this interpretation 
by clicking on some points, and finding the closest exemplars in the training set, and printing 
their names, as in Figure 12.2. However, in general, interpreting latent variable models is fraught 
with difficulties, as we discuss in Section 12.1.3. 


Unidentifiability 


Just like with mixture models, FA is also unidentifiable. To see this, suppose R. is an arbitrary 
orthogonal rotation matrix, satisfying RR’ = I. Let us define W = WR; then the likelihood 
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function of this modified matrix is the same as for the unmodified matrix, since 


cov [x] = WE [|zz"] WT +E [ee™] (12.10) 
= WRR’W!+0=WW' + Y (12.11) 


Geometrically, multiplying W by an orthogonal matrix is like rotating z before generating x; 
but since z is drawn from an isotropic Gaussian, this makes no difference to the likelihood. 
Consequently, we cannot unique identify W, and therefore cannot uniquely identify the latent 
factors, either. 

To ensure a unique solution, we need to remove L(L — 1)/2 degrees of freedom, since that 
is the number of orthonormal matrices of size L x L/ In total, the FA model has D + LD — 
L(L—1)/2 free parameters (excluding the mean), where the first term arises from Y. Obviously 
we require this to be less than or equal to D(D + 1)/2, which is the number of parameters in 
an unconstrained (but symmetric) covariance matrix. This gives us an upper bound on L, as 
follows: 


Lmax = |D +0.5(1— V1+8D)| (12.12) 


For example, D = 6 implies L < 3. But we usually never choose this upper bound, since it 
would result in overfitting (see discussion in Section 12.3 on how to choose L). 

Unfortunately, even if we set L < Lmazx, we still cannot uniquely identify the parameters, 
since the rotational ambiguity still exists. Non-identifiability does not affect the predictive per- 
formance of the model. However, it does affect the loading matrix, and hence the interpretation 
of the latent factors. Since factor analysis is often used to uncover structure in the data, this 
problem needs to be addressed. Here are some commonly used solutions: 


e Forcing W to be orthonormal Perhaps the cleanest solution to the identifiability problem 
is to force W to be orthonormal, and to order the columns by decreasing variance of the 
corresponding latent factors. This is the approach adopted by PCA, which we will discuss in 
Section 12.2. The result is not necessarily more interpretable, but at least it is unique. 

e Forcing W to be lower triangular One way to achieve identifiability, which is popular 
in the Bayesian community (e.g., (Lopes and West 2004)), is to ensure that the first visible 
feature is only generated by the first latent factor, the second visible feature is only generated 
by the first two latent factors, and so on. For example, if L = 3 and D = 4, the correspond 
factor loading matrix is given by 


w11 0 0 

we |v wz 0 (12.13) 
W31 W32 W33 
W41 W42 W43 

We also require that wj; > 0 for 7 = 1 : L. The total number of parameters in this 


constrained matrix is D + DL — L(L — 1)/2, which is equal to the number of uniquely 
identifiable parameters. The disadvantage of this method is that the first L visible variables, 


1. To see this, note that there are L — 1 free parameters in R in the first column (since the column vector must be 
normalized to unit length), there are L — 2 free parameters in the second column (which must be orthogonal to the 
first), and so on. 
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Figure 12.3 Mixture of factor analysers as a DGM. 


known as the founder variables, affect the interpretation of the latent factors, and so must 
be chosen carefully. 

e Sparsity promoting priors on the weights Instead of pre-specifying which entries in W 
are zero, we can encourage the entries to be zero, using 4/4 regularization (Zou et al. 2006), 
ARD (Bishop 1999; Archambeau and Bach 2008), or spike-and-slab priors (Rattray et al. 2009). 
This is called sparse factor analysis. This does not necessarily ensure a unique MAP estimate, 
but it does encourage interpretable solutions. See Section 13.8. 

e Choosing an informative rotation matrix There are a variety of heuristic methods that try 
to find rotation matrices R, which can be used to modify W (and hence the latent factors) so 
as to try to increase the interpretability, typically by encouraging them to be (approximately) 
sparse. One popular method is known as varimax (Kaiser 1958). 

e Use of non-Gaussian priors for the latent factors In Section 12.6, we will dicuss how re- 
placing p(z;) with a non-Gaussian distribution can enable us to sometimes uniquely identify 
W as well as the latent factors. This technique is known as ICA. 


Mixtures of factor analysers 


The FA model assumes that the data lives on a low dimensional linear manifold. In reality, most 
data is better modeled by some form of low dimensional curved manifold. We can approximate 
a curved manifold by a piecewise linear manifold. This suggests the following model: let the 
k’th linear subspace of dimensionality Lẹ be represented by Wx, for k = 1: K. Suppose we 
have a latent indicator q; € {1,..., K} specifying which subspace we should use to generate 
the data. We then sample z; from a Gaussian prior and pass it through the W, matrix (where 
k = qi), and add noise. More precisely, the model is as follows: 


p(xilzi qi = k,0) = N(xi|my, + Wkzi, Y) (12.14) 
p(z:|0) = N(z,\0,1) (12.15) 
P(G|@) = Cat(q:|m) (12.16) 
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Figure 12.4 Mixture of ld PPCAs fit to a dataset, for K = 1,10. Figure generated by 
mixPpcaDemoNetlab. 


This is called a mixture of factor analysers (MFA) (Hinton et al. 1997). The CI assumptions are 
represented in Figure 12.3. 

Another way to think about this model is as a low-rank version of a mixture of Gaussians. In 
particular, this model needs O(K LD) parameters instead of the O(K D?) parameters needed 
for a mixture of full covariance Gaussians. This can reduce overfitting. In fact, MFA is a good 
generic density model for high-dimensional real-valued data. 


EM for factor analysis models 


Using the results from Chapter 4, it is straightforward to derive an EM algorithm to fit an FA 
model. With just a little more work, we can fit a mixture of FAs. Below we state the results 
without proof. The derivation can be found in (Ghahramani and Hinton 1996a); however, deriving 
these equations yourself is a useful exercise if you want to become proficient at the math. 

To obtain the results for a single factor analyser, just set r;. = 1 and c = 1 in the equations 
below. In Section 12.2.5 we will see a further simplification of these equations that arises when 
fitting a PPCA model, where the results will turn out to have a particularly simple and elegant 
intepretation. 

In the E step, we compute the posterior responsibility of cluster c for data point i using 


Tic = plqi = c|xi, 9) X TN (Xile, WWT + 8) (12.17) 


The conditional posterior for z; is given by 


plzi|Xi qi =6,0) = N(zi|Mic, Xic) (12.18) 
Se ê 4+Wwiv,'w.)* (12.19) 
Mic = Die(We Wz "(xi — ue)) (12.20) 


In the M step, it is easiest to estimate u, and W, at the same time, by defining W. = 


12.1.6 


12.2 


12.2.1 
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(Wc, He), Z = (z, 1), Also, define 


bio * E[ž|xi, qi = c| = [m;c; 1] (12.21) 
ee ; See 
G, 2 engage ea Pea 12.22 
[22 |x, qi = c] ( eena 1 (12.22) 
Then the M step is as follows: 
-1 
W. = p nat > TicCic (12.23) 


wv = L diag > Tic (x = Webi) “| (12.24) 


1 N 
fie = Xori (12.25) 


Note that these updates are for “vanilla” EM. A much faster version of this algorithm, based 
on ECM, is described in (Zhao and Yu 2008). 


Fitting FA models with missing data 


In many applications, such as collaborative filtering, we have missing data. One virtue of the 
EM approach to fitting an FA/PPCA model is that it is easy to extend to this case. However, 
overfitting can be a problem if there is a lot of missing data. Consequently it is important to 
perform MAP estimation or to use Bayesian inference. See e.g., (Ilin and Raiko 2010) for details. 


Principal components analysis (PCA) 


Consider the FA model where we constrain & = o?I, and W to be orthonormal. It can 
be shown (Tipping and Bishop 1999) that, as a? — 0, this model reduces to classical (non- 
probabilistic) principal components analysis ( PCA), also known as the Karhunen Loeve 
transform. The version where o? > 0 is known as probabilistic PCA (PPCA) (Tipping and 
Bishop 1999), or sensible PCA (Roweis 1997). (An equivalent result was derived independently, 
from a different perspective, in (Moghaddam and Pentland 1995).) 

To make sense of this result, we first have to learn about classical PCA. We then connect PCA 
to the SVD. And finally we return to discuss PPCA. 


Classical PCA: statement of the theorem 
The synthesis view of classical PCA is summarized in the forllowing theorem. 


Theorem 12.2.1. Suppose we want to find an orthogonal set of L linear basis vectors w; € RP, 
and the corresponding scores z; € R", such that we minimize the average reconstruction error 


N 
1 7 
J(W,Z) = 7 X Ix: — ||? (12.26) 
=t 
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Figure 12.5 An illustration of PCA and PPCA where D = 2 and L = 1. Circles are the original data 
points, crosses are the reconstructions. The red star is the data mean. (a) PCA. The points are orthogonally 
projected onto the line. Figure generated by pcaDemo2d. (b) PPCA. The projection is no longer orthogonal: 
the reconstructions are shrunk towards the data mean (red star). Based on Figure 7.6 of (Nabney 2001). 
Figure generated by ppcaDemo2d. 


where X; = Wz;, subject to the constraint that W is orthonormal. Equivalently, we can write this 
objective as follows: 


J(W, Z) = ||X — WZ" |$ (12.27) 


where Z is an N x L matrix with the z; in its rows, and ||A || p is the Frobenius norm of matrix 


A, defined by 
. =4/tr(ATA) = ||A(:)|]2 (12.28) 


The optimal solution is obtained by setting W = Vz, where Vz contains the L eigenvectors 
with largest eigenvalues of the empirical covariance matrix, $ = + a xix}. (We assume the 
X; have zero mean, for notational simplicity.) Furthermore, the optimal low-dimensional encoding 
of the data is given by 2; = W" x;, which is an orthogonal projection of the data onto the column 
space spanned by the eigenvectors. 


An example of this is shown in Figure 12.5(a) for D = 2 and L = 1. The diagonal line is the 
vector W4; this is called the first principal component or principal direction. The data points 
x; € R? are orthogonally projected onto this line to get z; € R. This is the best 1-dimensional 
approximation to the data. (We will discuss Figure 12.5(b) later.) 

In general, it is hard to visualize higher dimensional data, but if the data happens to be a 
set of images, it is easy to do so. Figure 12.6 shows the first three principal vectors, reshaped 
as images, as well as the reconstruction of a specific image using a varying number of basis 
vectors. (We discuss how to choose L in Section 11.5.) 

Below we will show that the principal directions are the ones along which the data shows 
maximal variance. This means that PCA can be “misled” by directions in which the variance 
is high merely because of the measurement scale. Figure 12.7(a) shows an example, where the 
vertical axis (weight) uses a large range than the horizontal axis (height), resulting in a line that 
looks somewhat “unnatural”. It is therefore standard practice to standardize the data first, or 
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mean principal basis 1 reconstructed with 2 bases reconstructed with 10 bases 
principal basis 2 principal basis 3 reconstructed with 100 bases reconstructed with 506 bases 


3 


Figure 12.6 (a) The mean and the first three PC basis vectors (eigendigits) based on 25 images of the digit 
3 (from the MNIST dataset). (b) Reconstruction of an image based on 2, 10, 100 and all the basis vectors. 
Figure generated by pcaImageDemo. 
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Figure 12.7 Effect of standardization on PCA applied to the height/ weight dataset. Left: PCA of raw data. 
Right: PCA of standardized data. Figure generated by pcaDemoHeightWeight. 


equivalently, to work with correlation matrices instead of covariance matrices. The benefits of 
this are apparent from Figure 12.7(b). 


12.2.2 Proof * 


Proof. We use wj € R? to denote the j’th principal direction, x; € RP to denote the i'th 
high-dimensional observation, z; € R® to denote the i'th low-dimensional representation, and 
Z; € R to denote the [z1;,..., zyj], which is the j’th component of all the low-dimensional 
vectors. 

Let us start by estimating the best 1d solution, w; € RP, and the corresponding projected 
points Žı € RN. We will find the remaining bases w2, w3, etc. later. The reconstruction error 
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is given by 
1 Ë ťi T 
J(wi,Z1) = N 5 |x; — zawil| = N So: — 2 1W1) (Xi — 241W1) (12.29) 
i=1 i=1 
ix 
= F D — 2zi wT x; + 22,.wi wi] (12.30) 
eu 
=F 2 bixi — 2z wT x; + 22] (12.31) 


since wi w, = 1 (by the orthonormality assumption). Taking derivatives wrt z;; and equating 
to zero gives 


o 1 
Iza J(wı, Z1) = pwi xi + 2241] ss 23 = wi x; (12.32) 
So the optimal reconstruction weights are obtained by orthogonally projecting the data onto the 
first principal direction, w4 (see Figure 12.5(a)). Plugging back in gives 


i ies 
J(wi) = +) [x? x; — z3] = const — W A (12.33) 


i=l i=l 


Now the variance of the projected coordinates is given by 


N 
z «fe os 1 
var [Z1] = E [z7] — (E[z1])? = x za —0 (12.34) 
=l 
since 
2 [z1] = E [x7 wi] = E [x;]" wi = 0 (12.35) 


because the data has been centered. From this, we see that minimizing the reconstruction error 
is equivalent to maximizing the variance of the projected data, i.e., 


arg min J (w1) = arg max var [Žž] (12.36) 
wi Wi 


This is why it is often said that PCA finds the directions of maximal variance. This is called the 
analysis view of PCA. 
The variance of the projected data can be written as 


N N 

1 1 g 

N a= N X wi xix) wi = wi Èw; (12.37) 
i= i=1 


oS 1 N T : ae P : y PE 
where X = a ay 5 XiX; is the empirical covariance matrix (or correlation matrix if the 
data is standardized). 
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We can trivially maximize the variance of the projection (and hence minimize the recon- 


struction error) by letting ||w1|| —> co, so we impose the constraint ||w,|| = 1 and instead 
maximize 
J(wi) = wi Swi + Ai(wi'w, — 1) (12.38) 
where A; is the Lagrange multiplier. Taking derivatives and equating to zero we have 
a. A 
Ow, 
Sw, = Aw (12.40) 


Hence the direction that maximizes the variance is an eigenvector of the covariance matrix. Left 
multiplying by wı (and using w? w; = 1) we find that the variance of the projected data is 


wi dw) =A (12.41) 


Since we want to maximize the variance, we pick the eigenvector which corresponds to the 
largest eigenvalue. 
Now let us find another direction wə to further minimize the reconstruction error, subject to 
wi we = 0 and wi w = 1. The error is 
1a 
J(w1, Z1, W2, Z2) = N 5 |x: — ziw — zi2w2||? (12.42) 
i=1 
Optimizing wrt wı and z, gives the same solution as before. Exercise 12.4 asks you to show 
that Be = 0 yields z;2 = w4.x;. In other words, the second principal encoding is gotten by 
projecting onto the second principal direction. Substituting in yields 


N 
1 R 
J(w2) = z Ss — wl xix] w1 — wi xix] w2] = const — ws Swe (12.43) 
i=1 
Dropping the constant term and adding the constraints yields 


Exercise 12.4 asks you to show that the solution is given by the eigenvector with the second 
largest eigenvalue: 


Swe = A2we (12.45) 


The proof continues in this way. (Formally one can use induction.) 


12.2.3 
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Singular value decomposition (SVD) 


We have defined the solution to PCA in terms of eigenvectors of the covariance matrix. However, 
there is another way to obtain the solution, based on the singular value decomposition, or 
SVD. This basically generalizes the notion of eigenvectors from square matrices to any kind of 
matrix. 
In particular, any (real) N x D matrix X can be decomposed as follows 
X = U S V? 
Sa” Se n 
NxD NxNNxDDxD 


(12.46) 


where U is an N x N matrix whose columns are orthornormal (so UTU = Iy), V is D x D 
matrix whose rows and columns are orthonormal (so VTV = VVT = Ip), and S isa N x D 
matrix containing the r = min(N, D) singular values c; > 0 on the main diagonal, with 0s 
filling the rest of the matrix. The columns of U are the left singular vectors, and the columns 
of V are the right singular vectors. See Figure 12.8(a) for an example. 

Since there are at most D singular values (assuming N > D), the last N — D columns of U 
are irrelevant, since they will be multiplied by 0. The economy sized SVD, or thin SVD, avoids 
computing these unnecessary elements. Let us denote this decomposition by USV. If N > D, 
we have 

x= U8 VT (12.47) 
SS nar nine 
NxD NxD DxD DxD 


as in Figure 12.8(a). If N < D, we have 


xX = Ô $ Vt 
Soa Sa SY SA 
NxD NxN NxNNXD 


(12.48) 


Computing the economy-sized SVD takes O(N D min(N, D)) time (Golub and van Loan 1996, 
p254). 
The connection between eigenvectors and singular vectors is the following. For an arbitrary 


real matrix X, if X = USVT, we have 

XTX = VSTUT USV" = V(S‘'S)V7 = VDV” (12.49) 
where D = S? is a diagonal matrix containing the squares singular values. Hence 

(XTX)V = VD (12.50) 


so the eigenvectors of XTX are equal to V, the right singular vectors of X, and the eigenvalues 
of XTX are equal to D, the squared singular values. Similarly 


XX! = USV” VSTUT = U(SST)UT (12.51) 
(XXT)U = U(SST)=UD (12.52) 


so the eigenvectors of XXT are equal to U, the left singular vectors of X. Also, the eigenvalues 
of XXT are equal to the squared singular values. We can summarize all this as follows: 


U = evec(XX7), V = evec(X7X), S? = eval(XX7) = eval(X7 X) (12.53) 
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Figure 12.8 (a) SVD decomposition of non-square matrices X = USVT. The shaded parts of S, and all 
the off-diagonal terms, are zero. The shaded entries in U and S are not computed in the economy-sized 
version, since they are not needed. (b) Truncated SVD approximation of rank L. 


Since the eigenvectors are unaffected by linear scaling of a matrix, we see that the right 
singular vectors of X are equal to the eigenvectors of the empirical covariance $. Furthermore, 
the eigenvalues of È are a scaled version of the squared singular values. This means we can 
perform PCA using just a few lines of code (see pcaPmtk). 

However, the connection between PCA and SVD goes deeper. From Equation 12.46, we can 
represent a rank r matrix as follows: 


X=o,[u.](— vf -)t toju | (C ve -) (12.54) 


If the singular values die off quickly as in Figure 12.10, we can produce a rank L approximation 
to the matrix as follows: 


Xx UET SLLLE Vig (12.55) 


This is called a truncated SVD (see Figure 12.8(b)). The total number of parameters needed to 
represent an N x D matrix using a rank L approximation is 


NL+LID+L=L(N+D+1) (12.56) 
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Figure 12.9 Low rank approximations to an image. Top left: The original image is of size 200 x 320, so 
has rank 200. Subsequent images have ranks 2, 5, and 20. Figure generated by svdImageDemo. 
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Figure 12.10 First 50 log singular values for the clown image (solid red line), and for a data matrix 
obtained by randomly shuffling the pixels (dotted green line). Figure generated by svdImageDemo. 
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As an example, consider the 200 x 320 pixel image in Figure 12.9(top left). This has 64,000 
numbers in it. We see that a rank 20 approximation, with only (200 + 320+ 1) x 20 = 10, 420 
numbers is a very good approximation. 

One can show that the error in this approximation is given by 


[X= Xzlle ~ ons (12.57) 


Furthermore, one can show that the SVD offers the best rank L approximation to a matrix (best 
in the sense of minimizing the above Frobenius norm). 

Let us connect this back to PCA. Let X = USVT be a truncated SVD of X. We know that 
W = V, and that Z = XW, so 


Z = USV'V = US (12.58) 
Furthermore, the optimal reconstruction is given by X = ZW", so we find 

X = USV” (12.59) 
This is precisely the same as a truncated SVD approximation! This is another illustration of the 
fact that PCA is the best low rank approximation to the data. 
Probabilistic PCA 
We are now ready to revisit PPCA. One can show the following remarkable result. 


Theorem 12.2.2 ((Tipping and Bishop 1999)). Consider a factor analysis model in which YW = 071 
and W is orthogonal. The observed data log likelihood is given by 


N i N 
2 af <2 T Fa nea cle zje 
logp(X|W, 0°) = -3 ħlCl-3 2 Cox; z in|Cl|+t(C7S) (12.60) 
where C = WW! + o°I and S = = Sy x;x? = (1/N)X?X. (We are assuming centered 
data, for notational simplicity.) The maxima of the log-likelihood are given by 
W =V(A-—07I)?R (12.61) 


where R. is an arbitrary L x L orthogonal matrix, V is the D x L matrix whose columns are the 
first L eigenvectors of S, and A is the corresponding diagonal matrix of eigenvalues. Without loss 
of generality, we can set R. = I. Furthermore, the MLE of the noise variance is given by 


1 D 

a2 = : 

ae os Àj (12.62) 
j=L+1 

which is the average variance associated with the discarded dimensions. 


Thus, as 0? — 0, we have WwW > V, as in classical PCA. What about Z? It is easy to see that 
the posterior over the latent factors is given by 
p(zi\x;,0) = N(a\F-'W?x;,0°F-') (12.63) 


f 4£ W'W461I (12.64) 
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(Do not confuse F = WTW + o7I with C = WW! +071) Hence, as o? — 0, we find 
W > V, Ê —> I and 2; > VTx;. Thus the posterior mean is obtained by an orthogonal 
projection of the data onto the column space of V, as in classical PCA. 

Note, however, that if o? >, the posterior mean is not an orthogonal projection, since it is 
shrunk somewhat towards the prior mean, as illustrated in Figure 12.5(b). This sounds like an 
undesirable property, but it means that the reconstructions will be closer to the overall data 
mean, [ft = X. 


EM algorithm for PCA 


Although the usual way to fit a PCA model uses eigenvector methods, or the SVD, we can also 
use EM, which will turn out to have some advantages that we discuss below. EM for PCA relies 
on the probabilistic formulation of PCA. However the algorithm continues to work in the zero 
noise limit, ø? = 0, as shown by (Roweis 1997). 

Let Z be a L x N matrix storing the posterior means (low-dimensional representations) 
along its columns. Similarly, let X = XT store the original data along its columns. From 
Equation 12.63, when g? = 0, we have 


Z = (W'w)-!w?x (12.65) 


This constitutes the E step. Notice that this is just an orthogonal projection of the data. 
From Equation 12.23, the M step is given by 


W = nx tel £ z [z;] ta (12.66) 


i 


where we exploited the fact that © = cov [z;|x;, 0] = OI when o? = 0. It is worth comparing 
this expression to the MLE for multi-output linear regression (Equation 7.89), which has the form 
W = (0, yix] (X; xix) )~1. Thus we see that the M step is like linear regression where we 
replace the observed inputs by the expected values of the latent variables. 

In summary, here is the entire algorithm: 


° Estep Z = (W!W)-!W?X 
e M step W = XZ! (ZZ)! 


(Tipping and Bishop 1999) showed that the only stable fixed point of the EM algorithm is the 
globally optimal solution. That is, the EM algorithm converges to a solution where W spans 
the same linear subspace as that defined by the first L eigenvectors. However, if we want W 
to be orthogonal, and to contain the eigenvectors in descending order of eigenvalue, we have 
to orthogonalize the resulting matrix (which can be done quite cheaply). Alternatively, we can 
modify EM to give the principal basis directly (Ahn and Oh 2003). 

This algorithm has a simple physical analogy in the case D = 2 and L = 1 (Roweis 1997). 
Consider some points in R? attached by springs to a rigid rod, whose orientation is defined by a 
vector w. Let z; be the location where the i'th spring attaches to the rod. In the E step, we hold 
the rod fixed, and let the attachment points slide around so as to minimize the spring energy 
(which is proportional to the sum of squared residuals). In the M step, we hold the attachment 
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Figure 12.11 Illustration of EM for PCA when D = 2 and L = 1. Green stars are the original data points, 
black circles are their reconstructions. The weight vector w is represented by blue line. (a) We start with 
a random initial guess of w. The E step is represented by the orthogonal projections. (b) We update the 
rod w in the M step, keeping the projections onto the rod (black circles) fixed. (c) Another E step. The 
black circles can ’slide’ along the rod, but the rod stays fixed. (d) Another M step. Based on Figure 12.12 of 
(Bishop 2006b). Figure generated by pcaEmStepByStep. 


points fixed and let the rod rotate so as to minimize the spring energy. See Figure 12.11 for an 
illustration. 

Apart from this pleasing intuitive interpretation, EM for PCA has the following advantages 
over eigenvector methods: 


e EM can be faster. In particular, assuming N, D >> L, the dominant cost of EM is the pro- 
jection operation in the E step, so the overall time is O(T LN D), where T is the number of 
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Figure 12.12 Illustration of estimating the effective dimensionalities in a mixture of factor analysers using 
VBEM. The blank columns have been forced to 0 via the ARD mechanism. The data was generated from 
6 clusters with intrinsic dimensionalities of 7,4,3,2,2,1, which the method has successfully estimated. 
Source: Figure 4.4 of (Beal 2003). Used with kind permission of Matt Beal. 


iterations. (Roweis 1997) showed experimentally that the number of iterations is usually very 
small (the mean was 3.6), regardless of N or D. (This results depends on the ratio of eigenval- 
ues of the empirical covariance matrix.) This is much faster than the O(min(N D?, DN?)) 
time required by straightforward eigenvector methods, although more sophisticated eigenvec- 
tor methods, such as the Lanczos algorithm, have running times comparable to EM. 


e EM can be implemented in an online fashion, i.e., we can update our estimate of W as the 
data streams in. 


e EM can handle missing data in a simple way (see Section 12.1.6). 


e EM can be extended to handle mixtures of PPCA/ FA models. 


e EM can be modified to variational EM or to variational Bayes EM to fit more complex models. 


Choosing the number of latent dimensions 


In Section 11.5, we discussed how to choose the number of components K in a mixture model. 
In this section, we discuss how to choose the number of latent dimensions L in a FA/PCA model. 


Model selection for FA/PPCA 


If we use a probabilistic model, we can in principle compute L* = argmax,, p(L|D). However, 
there are two problems with this. First, evaluating the marginal likelihood for LVMs is quite 
difficult. In practice, simple approximations, such as BIC or variational lower bounds (see 
Section 21.5), can be used (see also (Minka 2000a)). Alternatively, we can use the cross-validated 
likelihood as a performance measure, although this can be slow, since it requires fitting each 
model F times, where F is the number of CV folds. 

The second issue is the need to search over a potentially large number of models. The usual 
approach is to perform exhaustive search over all candidate values of L. However, sometimes 
we can set the model to its maximal size, and then use a technique called automatic relevancy 
determination (Section 13.7), combined with EM, to automatically prune out irrelevant weights. 
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number of points intrinsic dimensionalities 
per cluster 1 7 4 3 2 2 


Figure 12.13 We show the estimated number of clusters, and their estimated dimensionalities, as a 
function of sample size. The VBEM algorithm found two different solutions when N = 8. Note that more 
clusters, with larger effective dimensionalities, are discovered as the sample sizes increases. Source: Table 
4.1 of (Beal 2003). Used with kind permission of Matt Beal. 


This technique will be described in a supervised context in Chapter 13, but can be adapted to 
the (M)FA context as shown in (Bishop 1999; Ghahramani and Beal 2000). 

Figure 12.12 illustrates this approach applied to a mixture of FAs fit to a small synthetic dataset. 
The figures visualize the weight matrices for each cluster, using Hinton diagrams, where where 
the size of the square is proportional to the value of the entry in the matrix.? We see that 
many of them are sparse. Figure 12.13 shows that the degree of sparsity depends on the amount 
of training data, in accord with the Bayesian Occam’s razor. In particular, when the sample 
size is small, the method automatically prefers simpler models, but as the sample size gets 
sufficiently large, the method converges on the “correct” solution, which is one with 6 subspaces 
of dimensionality 1, 2, 2, 3, 4 and 7. 

Although the ARD/ EM method is elegant, it still needs to perform search over K. This is 
done using “birth” and “death” moves (Ghahramani and Beal 2000). An alternative approach is to 
perform stochastic sampling in the space of models. Traditional approaches, such as (Lopes and 
West 2004), are based on reversible jump MCMC, and also use birth and death moves. However, 
this can be slow and difficult to implement. More recent approaches use non-parametric priors, 
combined with Gibbs sampling, see e.g., (Paisley and Carin 2009). 


Model selection for PCA 


Since PCA is not a probabilistic model, we cannot use any of the methods described above. An 
obvious proxy for the likelihood is the reconstruction error: 


1 5 
E(D, L) = Doe Ix; — Se: |? (12.67) 
iE€D 


In the case of PCA, the reconstruction is given by by &; = Wz; + p, where z; = WT (x; — p) 
and W and p are estimated from Dhrain- 


2. Geoff Hinton is an English professor of computer science at the University of Toronto. 
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Figure 12.14 Reconstruction error on MNIST vs number of latent dimensions used by PCA. (a) Training 
set. (b) Test set. Figure generated by pcaOverfitDemo. 


Figure 12.14(a) plots E(Dtrain, L) vs L on the MNIST training data in Figure 12.6. We see that 
it drops off quite quickly, indicating that we can capture most of the empirical correlation of the 
pixels with a small number of factors, as illustrated qualitatively in Figure 12.6. 

Exercise 12.5 asks you to prove that the residual error from only using L terms is given by the 
sum of the discarded eigenvalues: 


D 
E(Dtrain, L) = 5 Aj (12.68) 
j=L+1 


Therefore an alternative to plotting the error is to plot the retained eigenvalues, in decreasing 
order. This is called a scree plot, because “the plot looks like the side of a mountain, and ’scree’ 
refers to the debris fallen from a mountain and lying at its base”.? This will have the same shape 
as the residual error plot. 


A related quantity is the fraction of variance explained, defined as 


L 
et Àj 
Lmaz 
Dre Xj" 


This captures the same information as the scree plot. 

Of course, if we use L = rank(X), we get zero reconstruction error on the training set. 
To avoid overfitting, it is natural to plot reconstruction error on the test set. This is shown in 
Figure 12.14(b). Here we see that the error continues to go down even as the model becomes 
more complex! Thus we do not get the usual U-shaped curve that we typically expect to see. 

What is going on? The problem is that PCA is not a proper generative model of the data. 
It is merely a compression technique. If you give it more latent dimensions, it will be able to 
approximate the test data more accurately. By contrast, a probabilistic model enjoys a Bayesian 
Occam’s razor effect (Section 5.3.1), in that it gets “punished” if it wastes probability mass on 
parts of the space where there is little data. This is illustrated in Figure 12.15, which plots the 


F(Dirain, L) = (12.69) 


3. Quotation from http: //janda.org/workshop/factoranalysis/SPSSrun/SPSS08.htm. 
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Figure 12.15 Negative log likelihood on MNIST vs number of latent dimensions used by PPCA. (a) Training 
set. (b) Test set. Figure generated by pcaOverfitDemo. 


negative log likelihood, computed using PPCA, vs L. Here, on the test set, we see the usual 
U-shaped curve. 

These results are analogous to those in Section 11.5.2, where we discussed the issue of choosing 
K in the K-means algorithm vs using a GMM. 


Profile likelihood 


Although there is no U-shape, there is sometimes a “regime change” in the plots, from relatively 
large errors to relatively small. One way to automate the detection of this is described in (Zhu 
and Ghodsi 2006). The idea is this. Let A% be some measure of the error incurred by a model of 
size k, such that Ay; > Ag > --- > Az In PCA, these are the eigenvalues, but the method can 
also be applied to K-means. Now consider partitioning these values into two groups, depending 
on whether k < L or k > L, where L is some threshold which we will determine. To measure 
the quality of L, we will use a simple change-point model, where Ax ~ N (u1,0°) if k < L, 
and Ay ~ N (u2,0°) if k > L. (It is important that o? be the same in both models, to prevent 
overfitting in the case where one regime has less data than the other.) Within each of the two 
regimes, we assume the Ax are iid, which is obviously incorrect, but is adequate for our present 
purposes. We can fit this model for each L = 1 : Lmagx by partitioning the data and computing 
the MLEs, using a pooled estimate of the variance: 


mar’ 


o eet Àk _ CESE Àk 
Mle S S ae 
Àk — L))? Ab — L))? 
o(L) = DETA k — Ha(L)) u k — H2(L)) (2.70 


We can then evaluate the profile log likelihood 


L K 
UL) = So log N Aru (L), PL) + XO log M(Ag|p2(L),0?(L)) (12.72) 


k=1 k=L+1 


Finally, we choose L* = arg max (L). This is illustrated in Figure 12.16. On the left, we plot 
the scree plot, which has the same shape as in Figure 12.14(a). On the right, we plot the profile 


12.4 


402 Chapter 12. Latent linear models 


x 10° scree plot -5450 


eigenvalue 
9 


Figure 12.16 (a) Scree plot for training set, corresponding to Figure 12.14(a). (b) Profile likelihood. Figure 
generated by pcaOverfitDemo. 


likelihood. Rather miraculously, we see a fairly well-determined peak. 


PCA for categorical data 


In this section, we consider extending the factor analysis model to the case where the observed 
data is categorical rather than real-valued. That is, the data has the form yi; € {1,...,C}, 
where j = 1 : R is the number of observed response variables. We assume each y;,; is generated 
from a latent variable z; € IR’, with a Gaussian prior, which is passed through the softmax 
function as follows: 


p(zi) = N(0,T) (12.73) 
R 

p(yilzi 0) = |] Cat(yir|S(Wr 2 + wor) (12.74) 
r=1 


where W, c R’*™ is the factor loading matrix for response j, and wo, € R™ is the offset 
term for response r, and 0 = (W,, Wo,)/_,. (We need an explicit offset term, since clamping 
one element of z; to 1 can cause problems when computing the posterior covariance.) As in 
factor analysis, we have defined the prior mean to be mp = O and the prior covariance Vo = I, 
since we can capture non-zero mean by changing wo; and non-identity covariance by changing 
W,,. We will call this categorical PCA. See Chapter 27 for a discussion of related models. 

It is interesting to study what kinds of distributions we can induce on the observed variables 
by varying the parameters. For simplicity, we assume there is a single ternary response variable, 
so y; lives in the 3d probability simplex. Figure 12.17 shows what happens when we vary the 
parameters of the prior, mọ and Vo, which is equivalent to varying the parameters of the 
likelihood, W; and wo;. We see that this can define fairly complex distributions over the 
simplex. This induced distribution is known as the logistic normal distribution (Aitchison 
1982). 

We can fit this model to data using a modified version of EM. The basic idea is to infer 
a Gaussian approximation to the posterior p(z;|y;,@) in the E step, and then to maximize 0 
in the M step. The details for the multiclass case, can be found in (Khan et al. 2010) (see 
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Figure 12.17 Some examples of the logistic normal distribution defined on the 3d simplex. (a) Diagonal 
covariance and non-zero mean. (b) Negative correlation between states 1 and 2. (c) Positive correlation 
between states 1 and 2. Source: Figure 1 of (Blei and Lafferty 2007). Used with kind permission of David 
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Figure 12.18 Left: 150 synthetic 16 dimensional bit vectors. Right: the 2d embedding learned by binary 
PCA, using variational EM. We have color coded points by the identity of the true “prototype” that generated 
them. Figure generated by binaryFaDemoTipping. 


also Section 21.8.1). The details for the binary case for the the sigmoid link can be found in 
Exercise 21.9, and for the probit link in Exercise 21.10. 

One application of such a model is to visualize high dimensional categorical data. Fig- 
ure 12.18(a) shows a simple example where we have 150 6-dimensional bit vectors. It is clear that 
each sample is just a noisy copy of one of three binary prototypes. We fit a 2d catFA to this 


model, yielding approximate MLEs Ô. In Figure 12.18(b), we plot E [zilxi, ô. We see that there 


are three distinct clusters, as is to be expected. 

In (Khan et al. 2010), we show that this model outperforms finite mixture models on the task 
of imputing missing entries in design matrices consisting of real and categorical data. This is 
useful for analysing social science survey data, which often has missing data and variables of 
mixed type. 
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Figure 12.19 Gaussian latent factor models for paired data. (a) Supervised PCA. (b) Partial least squares. 
(c) Canonical correlation analysis. 


PCA for paired and multi-view data 


It is common to have a pair of related datasets, e.g., gene expression and gene copy number, or 
movie ratings by users and movie reviews. It is natural to want to combine these together into a 
low-dimensional embedding. This is an example of data fusion. In some cases, we might want 
to predict one element of the pair, say x;;, from the other one, x;2, via the low-dimensional 
“bottleneck”. 

Below we discuss various latent Gaussian models for these tasks, following the presentation 
of (Virtanen 2010). The models easily generalize from pairs to sets of data, Xim, for m = 1 : M. 
We focus on the case where Xim € R?™. In this case, the joint distribution is multivariate 
Gaussian, so we can easily fit the models using EM, or Gibbs sampling. 

We can generalize the models to handle discrete and count data by using the exponential 
family as a response distribution instead of the Gaussian, as we explain in Section 27.2.2. 
However, this will require the use of approximate inference in the E step (or an analogous 
modification to MCMC). 
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Supervised PCA (latent factor regression) 


Consider the following model, illustrated in Figure 12.19(a): 


p(zi) = N(0,Iz) (12.75) 
p(yilzi) = N (wy Zi + Hy, 74) (12.76) 
p(xilzi) = N(W22: + pz, 021p) (12.77) 


In (Yu et al. 2006), this is called supervised PCA. In (West 2003), this is called Bayesian factor 
regression. This model is like PCA, except that the target variable y; is taken into account when 
learning the low dimensional embedding. Since the model is jointly Gaussian, we have 


yilxi ~ N(x} w, o7 + w, Cwy) (12.78) 


where w = w-'W,,Cw,, W = 07Ip, and C1 = I + Www 'w,. So although this is a 
joint density model of (y;,x;), we can infer the implied conditional distribution. 
We now show an interesting connection to Zellner’s g-prior. Suppose p(w,) = MN (0, 15°), 


and let X = RVT be the SVD of X, where VTV = I and RTR = X’ = diag(o}) contains 
the squared singular values. Then one can show (West 2003) that 


p(w) = N (0, gV TE? V) = N(0, g(X7X)~*) (12.79) 


So the dependence of the prior for w on X arises from the fact that w is derived indirectly by 
a joint model of X and y. 

The above discussion focussed on regression. (Guo 2009) generalizes CCA to the exponential 
family, which is more appropriate if x; and/or y; are discrete. Although we can no longer 
compute the conditional p(y;|x;,@) in closed form, the model has a similar interpretation to 
the regression case, namely that we are predicting the response via a latent “bottleneck”. 

The basic idea of compressing x; to predict y; can be formulated using information theory. 
In particular, we might want to find an encoding distribution p(z|x) such that we minimize 


I(X; Z) — BI(X;Y) (12.80) 


where 8 > 0 is some parameter controlling the tradeoff between compression and predictive 
accuracy. This is known as the information bottleneck (Tishby et al. 1999). Often Z is taken to 
be discrete, as in clustering. However, in the Gaussian case, IB is closely related to CCA (Chechik 
et al. 2005). 

We can easily generalize CCA to the case where y; is a vector of responses to be predicted, as 
in multi-label classification. (Ma et al. 2008; Williamson and Ghahramani 2008) used this model 
to perform collaborative filtering, where the goal is to predict y;; € {1,...,5}, the rating person 
i gives to movie j, where the “side information” x; takes the form of a list of i's friends. The 
intuition behind this approach is that knowledge of who your friends are, as well as the ratings 
of all other users, should help predict which movies you will like. In general, any setting where 
the tasks are correlated could benefit from CCA. Once we adopt a probabilistic view, various 
extensions are straightforward. For example, we can easily generalize to the semi-supervised 
case, where we do not observe y; for all ¿ (Yu et al. 2006). 
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Discriminative supervised PCA 


One problem with this model is that it puts as much weight on predicting the inputs x; as the 
outputs y;. This can be partially alleviated by using a weighted objective of the following form 
(Rish et al. 2008): 


(0) = | [ pyn) Pili) (12.81) 


a 


where the a,,, control the relative importance of the data sources, and Nim = Wm2Zi. For 
Gaussian data, we can see that œm just controls the noise variance: 


1 1 
LO) x | [ep zalla? — Mill?) exp(—Zayllys — niyll”) (12.82) 


This interpretation holds more generally for the exponential family. Note, however, that it is hard 
to estimate the &m parameters, because changing them changes the normalization constant of 
the likelihood. We give an alternative approach to weighting y more heavily below. 


Partial least squares 


The technique of partial least squares (PLS) (Gustafsson 2001; Sun et al. 2009) is an asym- 
metric or more “discriminative” form of supervised PCA. The key idea is to allow some of the 
(covariance in the input features to be explained by its own subspace, z7, and to let the rest of 
the subspace, z, be shared between input and output. The model has the form 


p(zi) = N(z;|0,In,)N (27/0, Iz, ) (12.83) 
P(yilas) = N(W,z} + p,,07Ip,) (12.84) 
p(xilzi;) = N(We2 + Bez? + ua, 0°Ip,) (12.85) 


See Figure 12.19(b). The corresponding induced distribution on the visible variables has the form 


p(v;|0) = J N (vi|Wz; + u, o° D)N (z;|0, I)dz; = N (viju, WWT + 071) (12.86) 
where v; = (xi; yi), H = (Hy; Hx) and 
W = cy es ) (12.87) 
P w,w w,w 
WW = (ww! w,w?+B,B? ee 


We should choose L large enough so that the shared subspace does not capture covariate- 
specific variation. 

This model can be easily generalized to discrete data using the exponential family (Virtanen 
2010). 
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Canonical correlation analysis 


Canonical correlation analysis or CCA is like a symmetric unsupervised version of PLS: it 
allows each view to have its own “private” subspace, but there is also a shared subspace. If we 
have two observed variables, x; and y;, then we have three latent variables, zê € Ro which is 
shared, z? € R= and Ze € R4” which are private. We can write the model as follows (Bach 
and Jordan 2005): 


p(z) = N (ai 0, Tx, )N (z; |0, Iz, )N (z7 |0, Iz) (12.89) 
p(xilzi) = N(x:|Bzz? + Wzzi + u, 0°Ip,) (12.90) 
plyilzi) = N(yi|Byz? + Wyz? + Hy» o’Ip,) (12.91) 


See Figure 12.19(c). The corresponding observed joint distribution has the form 


el | N(vi|Wa; + u, 0° DN (2;|0, dz; = N (viju, WW" + 07 Ip) (12.92) 
where 

w= (w Ts) m 

ww = (ON We wwi Ee) as 


One can compute the MLE for this model using EM. (Bach and Jordan 2005) show that the 
resulting MLE is equivalent (up to rotation and scaling) to the classical, non-probabilistic view. 
However, the advantages of the probabilistic view are many: we can trivially generalize to M > 2 
observed variables; we can create mixtures of CCA (Viinikanoja et al. 2010); we can create sparse 
versions of CCA using ARD (Archambeau and Bach 2008); we can generalize to the exponential 
family (Klami et al. 2010); we can perform Bayesian inference of the parameters (Wang 2007; 
Klami and Kaski 2008); we can handle non-parametric sparsity-promoting priors for W and B 
(Rai and Daume 2009); and so on. 


Independent Component Analysis (ICA) 


Consider the following situation. You are in a crowded room and many people are speaking. 
Your ears essentially act as two microphones, which are listening to a linear combination of the 
different speech signals in the room. Your goal is to deconvolve the mixed signals into their 
constituent parts. This is known as the cocktail party problem, and is an example of blind 
signal separation (BSS), or blind source separation, where “blind” means we know “nothing” 
about the source of the signals. Besides the obvious applications to acoustic signal processing, 
this problem also arises when analysing EEG and MEG signals, financial data, and any other 
dataset (not necessarily temporal) where latent sources or factors get mixed together in a linear 
way. 

We can formalize the problem as follows. Let x; € R? be the observed signal at the sensors 
at “time” t, and z; € RŽ be the vector of source signals. We assume that 


Kt = Wz; + Et (12.95) 


408 Chapter 12. Latent linear models 


truth observed signals 
2 T T T T 10 T T T T 
E i i i i Ao i i , i 
0 100 200 300 400 500 0 100 200 300 400 500 
5 T T T T 5 T T T T 
ð HEHE EHH HHH OWA Vv Wind 
5 i i A 1 25 
0 100 200 300 400 500 0 100 200 300 400 500 
2 T T T T 10 
-2 , i / i ~10 
0 100 200 300 400 500 0 500 
10 T T T T 5 
0 Pitura lansa of agri yaei 
-10 , 1 , , 5 
0 100 200 300 400 500 0 500 
(a) 
PCA estimate ICA estimate 
10 T T T T 5 T T T T 
T RRR RHRRERRRRRRERERERET 
0 i i i G5 i i i i 
0 100 200 300 400 500 0 100 200 300 400 500 
5 T T T T 10 
aiaa n trana a alia il AA 0 Enna 
5 i i i i H0 
0 100 200 300 400 500 0 500 
2 T T T T 2 
3 i ‘ i 2 
0 100 200 300 400 500 500 
1 T T T T r 
0 Adinandra ANAA i AIAN 
24 ‘ i -2! 
0 100 200 300 400 500 500 


(c) (d) 


Figure 12.20 Illustration of ICA applied to 500 iid samples of a 4d source signal. (a) Latent signals. (b) 
Observations. (c) PCA estimate. (d) ICA estimate. Figure generated by icaDemo, written by Aapo Hyvarinen. 


where W is an D x L matrix, and e ~ M (0, Ẹ). In this section, we treat each time point 
as an independent observation, i.e., we do not model temporal correlation (so we could replace 
the ¢ index with i, but we stick with t to be consistent with much of the ICA literature). The 
goal is to infer the source signals, p(z+|x+, 0), as illustrated in Figure 12.20. In this context, W 
is called the mixing matrix. If L = D (number of sources = number of sensors), it will be a 
square matrix. Often we will assume the noise level, |W|, is zero, for simplicity. 

So far, the model is identical to factor analysis (or PCA if there is no noise, except we don't in 
general require orthogonality of W). However, we will use a different prior for p(z;). In PCA, 
we assume each source is independent, and has a Gaussian distribution 


p(ze) = | [ M(a;10, 1) (12.96) 


We will now relax this Gaussian assumption and let the source distributions be any non-Gaussian 
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Figure 12.21 Illustration of ICA and PCA applied to 100 iid samples of a 2d source signal with a uniform 
distribution. (a) Latent signals. (b) Observations. (c) PCA estimate. (d) ICA estimate. Figure generated by 
icaDemoUniform, written by Aapo Hyvarinen. 


distribution 
L 
p(z) = | [ pil) (12.97) 
j=1 


Without loss of generality, we can constrain the variance of the source distributions to be 1, 
because any other variance can be modelled by scaling the rows of W appropriately. The 
resulting model is known as independent component analysis or ICA. 

The reason the Gaussian distribution is disallowed as a source prior in ICA is that it does not 
permit unique recovery of the sources, as illustrated in Figure 12.20(c). This is because the PCA 
likelihood is invariant to any orthogonal transformation of the sources z; and mixing matrix W. 
PCA can recover the best linear subspace in which the signals lie, but cannot uniquely recover 
the signals themselves. 


12.6.1 


410 Chapter 12. Latent linear models 


To illustrate this, suppose we have two independent sources with uniform distributions, as 
shown in Figure 12.21(a). Now suppose we have the following mixing matrix 


2 3 
W = - i) (12.98) 


Then we observe the data shown in Figure 12.21(b) (assuming no noise). If we apply PCA followed 
by scaling to this, we get the result in Figure 12.21(c). This corresponds to a whitening of the 
data. To uniquely recover the sources, we need to perform an additional rotation. The trouble 
is, there is no information in the symmetric Gaussian posterior to tell us which angle to rotate 
by. In a sense, PCA solves “half” of the problem, since it identifies the linear subspace; all 
that ICA has to do is then to identify the appropriate rotation. (Hence we see that ICA is not 
that different from methods such as varimax, which seek good rotations of the latent factors to 
enhance interpretability.) 

Figure 12.21(d) shows that ICA can recover the source, up to a permutation of the indices and 
possible sign change. ICA requires that W is square and hence invertible. In the non-square 
case (e.g., where we have more sources than sensors), we cannot uniquely recover the true signal, 
but we can compute the posterior p(z,|x,;,W), which represents our beliefs about the source. 
In both cases, we need to estimate W as well as the source distributions p;. We discuss how 
to do this below. 


Maximum likelihood estimation 


In this section, we discuss ways to estimate square mixing matrices W for the noise-free ICA 
model. As usual, we will assume that the observations have been centered; hence we can also 
assume z is zero-mean. In addition, we assume the observations have been whitened, which 
can be done with PCA. 


If the data is centered and whitened, we have E ex’ | =I. But in the noise free case, we 
also have 
cov [x] = E [xx”] = WE |[zz’] W7 = WW” (12.99) 


Hence we see that W must be orthogonal. This reduces the number of parameters we have to 
estimate from D? to D(D — 1)/2. It will also simplify the math and the algorithms. 

Let V = W- !; these are often called the recognition weights, as opposed to W, which are 
the generative weights.’ 

Since x = Wz, we have, from Equation 2.89, 


p2(Wa2) = p- (z4)| det(W~1)| = p.(Vxz)| det(V)| (12.100) 


Hence we can write the log-likelihood, assuming T iid samples, as follows: 


1 1 m 
7 logp(D|V) = log | det(V)| + 5 ) S log pj (v7 Xs) (12.101) 


j=1 t=1 


4. In the literature, it is common to denote the generative weights by A and the recognition weights by W, but we are 
trying to be consistent with the notation used earlier in this chapter. 
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where v; is the j’th row of V. Since we are constraining V to be orthogonal, the first term is a 
constant, so we can drop it. We can also replace the average over the data with an expectation 
operator to get the following objective 


L 
NLL(V) = X` E [G;(z;)] (12.102) 


j=1 


where z; = v? x and Gj(z) £ — log p;(z). We want to minimize this subject to the constraint 
that the rows of V are orthogonal. We also want them to be unit norm, since this ensures 
that the variance of the factors is unity (since, with whitened data, E [vj x] = Ivl, which is 
necessary to fix the scale of the weights. In otherwords, V should be an orthonormal matrix. 
It is straightforward to derive a gradient descent algorithm to fit this model; however, it 
is rather slow. One can also derive a faster algorithm that follows the natural gradient; see 
e.g., (MacKay 2003, ch 34) for details. A popular alternative is to use an approximate Newton 
method, which we discuss in Section 12.6.2. Another approach is to use EM, which we discuss 


in Section 12.6.3. 


The FastICA algorithm 


We now describe the fast ICA algorithm, based on (Hyvarinen and Oja 2000), which we will 
show is an approximate Newton method for fitting ICA models. 

For simplicity of presentation, we initially assume there is only one latent factor. In addition, 
we initially assume all source distributions are known and are the same, so we can just write 


G(z) = —logp(z). Let g(z) = £G(z). The constrained objective, and its gradient and 
Hessian, are given by 
f(v) = E[G(v"x)] +A. -v"v) (12.103) 
Vi(v) = E[xg(v"x)] — 8v (12.104) 
H(v) = E[xx"g'(v’x)] — SI (12.105) 
where 3 = 2A is a Lagrange multiplier. Let us make the approximation 
a [xx"g/(v"x)] zE [xx"] J [9'(v" x)] =E [9'(v" x)] (12.106) 
This makes the Hessian very easy to invert, giving rise to the following Newton update: 
ye 2 y — Elxa(v's)] — bv (12.107) 
i [g'(v?x)] — B 
One can rewrite this in the following way 
v* £ E [xg(v’x)] — E [g'(v"x)] v (12.108) 


(In practice, the expectations can be replaced by Monte Carlo estimates from the training set, 
which gives an efficient online learning algorithm.) After performing this update, one should 
project back onto the constraint surface using 


* 
yew A vV 


a X (12.109) 
Ilv*l| 
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Figure 12.22 Illustration of Gaussian, sub-Gaussian (uniform) and super-Gaussian (Laplace) distributions 
in ld and 2d. Figure generated by subSuperGaussPlot, written by Kevin Swersky. 


One iterates this algorithm until convergence. (Due to the sign ambiguity of v, the values of v 
may not converge, but the direction defined by this vector should converge, so one can assess 
convergence by monitoring |v’ v”®®|, which should approach 1.) 

Since the objective is not convex, there are multiple local optima. We can use this fact to 
learn multiple different weight vectors or features. We can either learn the features sequentially 
and then project out the part of v; that lies in the subspace defined by earlier features, or 
we can learn them in parallel, and orthogonalize V in parallel. This latter approach is usually 
preferred, since, unlike PCA, the features are not ordered in any way. So the first feature is not 
“more important” than the second, and hence it is better to treat them symmetrically. 
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12.6. Independent Component Analysis (ICA) 413 


Modeling the source densities 


So far, we have assumed that G(z) = — log p(z) is known. What kinds of models might be 
reasonable as signal priors? We know that using Gaussians (which correspond to quadratic 
functions for G) won't work. So we want some kind of non-Gaussian distribution. In general, 
there are several kinds of non-Gaussian distributions, such as the following: 


e Super-Gaussian distributions These are distributions which have a big spike at the mean, 
and hence (in order to ensure unit variance) have heavy tails. The Laplace distribution is 
a classic example. See Figure 12.22. Formally, we say a distribution is super-Gaussian or 
leptokurtic (“lepto” coming from the Greek for “thin”) if kurt(z) > 0, where kurt(z) is the 
kurtosis of the distribution, defined by 


kurt(z) 4 4 —3 (12.110) 
oO 


where ø is the standard deviation, and ux is the k’th central moment, or moment about 
the mean: 


Hk = E [(X — E[X])*] (12.111) 


(So uı = pu is the mean, and u2 = o? is the variance.) It is conventional to subtract 3 in the 
definition of kurtosis to make the kurtosis of a Gaussian variable equal to zero. 

e Sub-Gaussian distributions A sub-Gaussian or platykurtic (“platy’ coming from the Greek 
for “broad”) distribution has negative kurtosis. These are distributions which are much flatter 
than a Gaussian. The uniform distribution is a classic example. See Figure 12.22. 

e Skewed distributions Another way to “be non-Gaussian” is to be asymmetric. One measure 
of this is skewness, defined by 


skew(z) £ £ (12.112) 
= 


An example of a (right) skewed distribution is the gamma distribution (see Figure 2.9). 


When one looks at the empirical distribution of many natural signals, such as images and 
speech, when passed through certain linear filters, they tend to be very super-Gaussian. This 
result holds both for the kind of linear filters found in certain parts of the brain, such as the 
simple cells found in the primary visual cortex, as well as for the kinds of linear filters used in 
signal processing, such as wavelet transforms. One obvious choice for modeling natural signals 
with ICA is therefore the Laplace distribution. For mean zero and variance 1, this has a log pdf 
given by 


log p(z) = —V2|z| — log(v2) (12.113) 


Since the Laplace prior is not differentiable at the origin, it is more common to use other, 
smoother super-Gaussian distributions. One example is the logistic distribution. The corre- 
sponding log pdf, for the case where the mean is zero and the variance is 1 (so 4 = 0 and 


s= V3), is given by the following: 


log p(z) = —2 log cosh( 552) log e (12.114) 
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Figure 12.23 Modeling the source distributions using a mixture of univariate Gaussians (the independent 
factor analysis model of (Moulines et al. 1997; Attias 1999)). 


Various ways of estimating G(Z) = — log p(z) are discussed in the seminal paper (Pham and 
Garrat 1997). However, when fitting ICA by maximum likelihood, it is not critical that the exact 
shape of the source distribution be known (although it is important to know whether it is sub 
or super Gaussian). Consequently, it is common to just use G(z) = yz or G(z) = log cosh(z) 
instead of the more complex expressions above. 


Using EM 


An alternative to assuming a particular form for G(z), or equivalently for p(z), is to use a 
flexible non-parametric density estimator, such as a mixture of (uni-variate) Gaussians: 


plaj =k) = Tk (12.115) 
p(zjlaqj =k) = N (hjk 03r) (12.116) 
p(x|z) = N(Wz, 8) (12.117) 


This approach was proposed in (Moulines et al. 1997; Attias 1999), and the corresponding graph- 
ical model is shown in Figure 12.23. 

It is possible to derive an exact EM algorithm for this model. The key observation is that 
it is possible to compute E |z;|x;, 0] exactly by summing over all K? combinations of the q: 
variables, where K is the number of mixture components per source. (If this is too expensive, 
one can use a variational mean field approximation (Attias 1999).) We can then estimate all the 
source distributions in parallel by fitting a standard GMM to E [z,]. When the source GMMs are 


12.6.4 


12.6.4.1 


12.6.4.2 


12.6. Independent Component Analysis (ICA) 415 


known, we can compute the marginals p;(z,;) very easily, using 


K 
) = XU Tj N (zl 5,5 OF &) (12.118) 


k=1 


Given the p;’s, we can then use an ICA algorithm to estimate W. Of course, these steps should 
be interleaved. The details can be found in (Attias 1999). 


Other estimation principles * 


It is quite common to estimate the parameters of ICA models using methods that seem different 
to maximum likelihood. We will review some of these methods below, because they give 
additional insight into ICA. However, we will also see that these methods in fact are equivalent 
to maximum likelihood after all. Our presentation is based on (Hyvarinen and Oja 2000). 


Maximizing non-Gaussianity 


An early approach to ICA was to find a matrix V such that the distribution z = Vx is as far 
from Gaussian as possible. (There is a related approach in statistics called projection pursuit.) 
One measure of non-Gaussianity is kurtosis, but this can be sensitive to outliers. Another 
measure is the negentropy, defined as 


negentropy(z) = H (N (u, o°)) — H(z) (12.119) 


where u = E [z] and o? = var [z]. Since the Gaussian is the maximum entropy distribution, 
this measure is always non-negative and becomes large for distributions that are highly non- 
Gaussian. 

We can define our objective as maximizing 


J(V = 5  negentropy( 25) = MIN (15,05 *)) — H (z;) (12.120) 
j 


where z = Vx. If we fix V to be orthogonal, and if we whiten the data, the covariance of z 
will be I independently of V, so the first term is a constant. Hence 


= 5 —H (z;) + const = 5 z [log p(z;)] + const (12.121) 
j J 


which we see is equal (up to a sign change, and irrelevant constants) to the log-likelihood in 
Equation 12.102. 


Minimizing mutual information 


One measure of dependence of a set of random variables is the multi-information: 


I(z) KL | p(z)||] [ p(z;) | = 2 H(z;) — H(z) (12.122) 


j 
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We would like to minimize this, since we are trying to find independent components. Put 
another way, we want the best possible factored approximation to the joint distribution. 
Now since z = Vx, we have 


T(z) = X H(z) — H(Vx) (12.123) 


If we constrain V to be orthogonal, we can drop the last term, since then H(Vx) = H(x) 
(since multiplying by V does not change the shape of the distribution), and Hx) is a constant 
which is is solely determined by the empirical distribution. Hence we have I(z) = `; H(z;). 
Minimizing this is equivalent to maximizing the negentropy, which is equivalent to maximum 


likelihood. 


Maximizing mutual information (infomax) 


Instead of trying to minimize the mutual information between the components of z, let us 
imagine a neural network where x is the input and y; = (vj; x) + is the noisy output, where 
@ is some nonlinear scalar function, and € ~ N (0,1). It seems reasonable to try to maximize 
the information flow through this system, a principle known as infomax. (Bell and Sejnowski 
1995). That is, we want to maximize the mutual information between y (the internal neural 
representation) and x (the observed input signal). We have I(x; y) = H(y) — H(y|x), where 
the latter term is constant if we assume the noise has constant variance. One can show that we 


can approximate the former term as follows 


H(y) = X.E [log ¢'(vj x)] + log | det(V)| (12.124) 


j=1 


where, as usual, we can drop the last term if V is orthogonal. If we define ¢(z) to be a cdf, 
then ¢/(z) is its pdf, and the above expression is equivalent to the log likelihood. In particular, 
if we use a logistic nonlinearity, ¢(z) = sigm(z), then the corresponding pdf is the logistic 
distribution, and log ¢’(z) = logcosh(z) (ignoring irrelevant constants). Thus we see that 
infomax is equivalent to maximum likelihood. 


Exercises 


Exercise 12.1 M step for FA 
For the FA model, show that the MLE in the M step for W is given by Equation 12.23. 


Exercise 12.2 MAP estimation for the FA model 


Derive the M step for the FA model using conjugate priors for the parameters. 


Exercise 12.3 Heuristic for assessing applicability of PCA 

(Source: (Press 2005, Q9.8).). Let the empirical covariance matrix © have eigenvalues 41 > A2 >- > 
Xa > 0. Explain why the variance of the evalues, o° = 4 5 Ai — À)? is a good measure of whether 
or not PCA would be useful for analysing the data (the higher the value of o” the more useful PCA). 
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Exercise 12.4 Deriving the second principal component 
a. Let 


n 


i 
J(v2, z2) = ma J (Xi — z1 V1 — Zi2V2)” (Xi — Zvi — 2i2V2) (12.125) 
izi 


Show that s = 0 yields z;2 = vi x;. 
b. Show that the value of v2 that minimizes 
J(v2) = —vi Cvə + Az(v3 v2 —1)+ Aro(v3 vi — 0) (12.126) 
is given by the eigenvector of C with the second largest eigenvalue. Hint: recall that Cvı = A1v1 and 


oe ee (a AT x, 


Exercise 12.5 Deriving the residual error for PCA 


a. Prove that 
K K 
|x: — DD zyvjl|? =x? xi — X vj xixi Vy (12.127) 
j=1 j=1 
Hint: first consider the case K = 2. Use the fact that vi vj = 1 and Vi Vk = 0 for k # j. Also, 
recall zi; = x? vj. 
b. Now show that 


n 


K n K 
1 1 
Jg £ E > (ix — ` Pr!) = z ` x) x; — ` Àj (12.128) 
i=1 j=1 i=1 j=1 


Hint: recall vj Cvj = Ajvj vj = Aj. 


c. If K = d there is no truncation, so Ja = 0. Use this to show that the error from only using K < d 
terms is given by 


d 
Jg = 5 àj (12.129) 
j=K+1 


Hint: partition the sum Yai A; into D A; and an dj. 


Exercise 12.6 Derivation of Fisher’s linear discriminant 


Show that the maximum of J(w) = SEW. is given by Sew = ASww 
where \ = o Hint: recall that the derivative of a ratio of two scalars is given by 4 f a = Pa i 
where f’ = -4 f(x) and g' = “g(z). Also, recall that 4x7 Ax = (A + A7)x. 
Exercise 12.7 PCA via successive deflation 
Let vi, V2,..., Vp be the first k eigenvectors with largest eigenvalues of C = 1X7X, i.e., the principal 
basis vectors. These satisfy 
r, _f 0 iff Ak 
Vj Vk = { L eh (12.130) 


We will construct a method for finding the v; sequentially. 
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As we showed in class, vı is the first principal eigenvector of C, and satisfies Cv; = A1v1. Now define 
X; as the orthogonal projection of x; onto the space orthogonal to v1: 


ži = Piv x: = (I-vivi)x: (12.131) 


Define X = [X1;...;Xn] as the deflated matrix of rank d — 1, which is obtained by removing from the d 
dimensional data the component that lies in the direction of the first principal direction: 


X = (I- viv] )"X = (I — viv; )X (12.132) 


a. Using the facts that XTXvı = nàıvı (and hence v7? XTX = nd, v7) and v7 v; = 1, show that 
the covariance of the deflated matrix is given by 


c4 ea = “XX Aviv? (12.133) 


b. Let u be the principal eigenvector of Č. Explain why u = vo. (You may assume u is unit norm.) 


c. Suppose we have a simple method for finding the leading eigenvector and eigenvalue of a pd matrix, 
denoted by [A, u] = f(C). Write some pseudo code for finding the first K principal basis vectors of 
X that only uses the special f function and simple vector arithmetic, i.e., your code should not use 
SVD or the eig function. Hint: this should be a simple iterative routine that takes 2-3 lines to write. 
The input is C, K and the function f, the output should be v; and A; for j = 1 : K. Do not worry 
about being syntactically correct. 


Exercise 12.8 Latent semantic indexing 


(Source: de Freitas.). In this exercise, we study a technique called latent semantic indexing, which applies 
SVD to a document by term matrix, to create a low-dimensional embedding of the data that is designed to 
capture semantic similarity of words. 


The file 1siDocuments.pdf contains 9 documents on various topics. A list of all the 460 unique 
words/terms that occur in these documents is in lsiWords.txt. A document by term matrix is in 
lsiMatrix.txt. 


a. Let X be the transpose of 1siMatrix, so each column represents a document. Compute the SVD of X 


and make an approximation to it X using the first 2 singular values/ vectors. Plot the low dimensional 
representation of the 9 documents in 2D. You should get something like Figure 12.24. 


b. Consider finding documents that are about alien abductions. If If you look at lsiWords.txt, there 
are 3 versions of this word, term 23 (“abducted”), term 24 (“abduction”) and term 25 (“abductions”). 
Suppose we want to find documents containing the word “abducted”. Documents 2 and 3 contain it, 
but document 1 does not. However, document 1 is clearly related to this topic. Thus LSI should also 
find document 1. Create a test document q containing the one word “abducted”, and project it into 
the 2D subspace to make g. Now compute the cosine similarity between ĝ and the low dimensional 
representation of all the documents. What are the top 3 closest matches? 


Exercise 12.9 Imputation in a FA model 
Derive an expression for p(x, |X», 0) for a FA model. 
Exercise 12.10 Efficiently evaluating the PPCA density 


Derive an expression for p(x|W, &*) for the PPCA model based on plugging in the MLEs and using the 
matrix inversion lemma. 
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Figure 12.24 Projection of 9 documents into 2 dimensions. Figure generated by 1siCode. 


Exercise 12.11 PPCA vs FA 


(Source: Exercise 14.15 of (Hastie et al. 2009), due to Hinton.). Generate 200 observations from the following 
model, where z; ~ N (0,1): £i = zii Vig = 211 + 0.001 2:2, vi3 = 10z;i3. Fit a FA and PCA model 
with 1 latent factor. Hence show that the corresponding weight vector w aligns with the maximal variance 
direction (dimension 3) in the PCA case, but with the maximal correlation direction (dimensions 1+2) in the 
case of FA. 


13.1 


Sparse linear models 


Introduction 


We introduced the topic of feature selection in Section 3.5.4, where we discussed methods for 
finding input variables which had high mutual information with the output. The trouble with 
this approach is that it is based on a myopic strategy that only looks at one variable at a time. 
This can fail if there are interaction effects. For example, if y = xor(21, x2), then neither xı nor 
zə on its own can predict the response, but together they perfectly predict the response. For a 
real-world example of this, consider genetic association studies: sometimes two genes on their 
own may be harmless, but when present together they cause a recessive disease (Balding 2006). 

In this chapter, we focus on selecting sets of variables at a time using a model-based approach. 
If the model is a generalized linear model, of the form p(y|x) = p(y|f(w?x)) for some link 
function f, then we can perform feature selection by encouraging the weight vector w to be 
sparse, i.e., to have lots of zeros. This approach turns out to offer significant computational 
advantages, as we will see below. 

Here are some applications where feature selection/ sparsity is useful: 


e In many problems, we have many more dimensions D than training cases N. The cor- 
responding design matrix is short and fat, rather than tall and skinny. This is called the 
small N, large D problem. This is becoming increasingly prevalent as we develop more 
high throughput measurement devices, For example, with gene microarrays, it is common 
to measure the expression levels of D ~ 10,000 genes, but to only get N ~ 100 such 
examples. (It is perhaps a sign of the times that even our data seems to be getting fatter...) 
We may want to find the smallest set of features that can accurately predict the response 
(e.g., growth rate of the cell) in order to prevent overfitting, to reduce the cost of building a 
diagnostic device, or to help with scientific insight into the problem. 


e In Chapter 14, we will use basis functions centered on the training examples, so @(x) = 
[k(x, X1), .--, K(X, Xy )], where « is a kernel function. The resulting design matrix has size 
N x N. Feature selection in this context is equivalent to selecting a subset of the training 
examples, which can help reduce overfitting and computational cost. This is known as a 
sparse kernel machine. 


e In signal processing, it is common to represent signals (images, speech, etc.) in terms of 
wavelet basis functions. To save time and space, it is useful to find a sparse representation 
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of the signals, in terms of a small number of such basis functions. This allows us to estimate 
signals from a small number of measurements, as well as to compress the signal. See 
Section 13.8.3 for more information. 


Note that the topic of feature selection and sparsity is currently one of the most active areas 
of machine learning/ statistics. In this chapter, we only have space to give an overview of the 
main results. 


Bayesian variable selection 


A natural way to pose the variable selection problem is as follows. Let yj = 1 if feature j is 
“relevant”, and let y; = 0 otherwise. Our goal is to compute the posterior over models 


e~ fy) 
p(y|D) = oe (13. 
where f(y) is the cost function: 
f(y) = —[log p(Dly) + log p(y) (13.2) 
For example, suppose we generate N = 20 samples from a D = 10 dimensional linear 
regression model, y; ~ N (w? x;, 07), in which K = 5 elements of w are non-zero. In 


particular, we use w = (0.00, —1.67, 0.13, 0.00, 0.00, 1.19, 0.00, —0.04, 0.33, 0.00) and o? = 
1. We enumerate all 21° = 1024 models and compute p(y|D) for each one (we give the 
equations for this below). We order the models in Gray code order, which ensures consecutive 
vectors differ by exactly 1 bit (the reasons for this are computational, and are discussed in 
Section 13.2.3). 

The resulting set of bit patterns is shown in Figure 13.1(a). The cost of each model, f(-y), is 
shown in Figure 13.1(b). We see that this objective function is extremely “bumpy”. The results 
are easier to interpret if we compute the posterior distribution over models, p(y|D). This is 
shown in Figure 13.l(c). The top 8 models are listed below: 

model prob members 


4 0.447 2, 

6l 0.241 2, 6, 
452 0.103 2, 6, 9, 
60 0.091 2, 3, 6, 
29 0.041 2,5, 
68 0.021 2, 6, 7, 
36 0.015 2, 5, 6, 
5 0.010 2,3, 


The “true” model is {2,3,6,8,9}. However, the coefficients associated with features 3 and 8 
are very small (relative to 77). so these variables are harder to detect. Given enough data, the 
method will converge on the true model (assuming the data is generated from a linear model), 
but for finite data sets, there will usually be considerable posterior uncertainty. 

Interpreting the posterior over a large number of models is quite difficult, so we will seek 
various summary statistics. A natural one is the posterior mode, or MAP estimate 


4 = argmax p(y|D) = argmin f(y) (13.3) 
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Figure 13.1 (a) All possible bit vectors of length 10 enumerated in Gray code order. (b) Score function for 
all possible models. (c) Posterior over all 1024 models. Vertical scale has been truncated at 0.1 for clarity. 
(d) Marginal inclusion probabilities. Figure generated by linregAllsubsetsGraycodeDemo. 


However, the mode is often not representative of the full posterior mass (see Section 5.2.1.3). A 
better summary is the median model (Barbieri and Berger 2004; Carvahlo and Lawrence 2007), 
computed using 


= {j : p(y; = 1|D) > 0.5} (13.4) 


This requires computing the posterior marginal inclusion probabilities, p(y; = 1|D). These 
are shown in Figure 13.1(d). We see that the model is confident that variables 2 and 6 are 
included; if we lower the decision threshold to 0.1, we would add 3 and 9 as well. However, if 
we wanted to “capture” variable 8, we would incur two false positives (5 and 7). This tradeoff 
between false positives and false negatives is discussed in more detail in Section 5.7.2.1. 

The above example illustrates the “gold standard” for variable selection: the problem was 
sufficiently small (only 10 variables) that we were able to compute the full posterior exactly. 
Of course, variable selection is most useful in the cases where the number of dimensions is 
large. Since there are 2” possible models (bit vectors), it will be impossible to compute the 
full posterior in general, and even finding summaries, such as the MAP estimate or marginal 
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inclusion probabilities, will be intractable. We will therefore spend most of this chapter focussing 
on algorithmic speedups. But before we do that, we will explain how we computed p(y|D) in 
the above example. 


The spike and slab model 
The posterior is given by 


P(YIP) x p(y) p(Ply) (13.5) 


We first consider the prior, then the likelihood. 
It is common to use the following prior on the bit vector: 


p(y) = [| Berro) = 07"? (1 — ro)?! (13.6) 


where 79 is the probability a feature is relevant, and ||y||o = poe qj is the lo pseudo-norm, 
that is, the number of non-zero elements of the vector. For comparison with later models, it is 
useful to write the log prior as follows: 


log p(y|70) = |lyllologro + (D — |lyllo) log(1 — 7o) (13.7) 
= |ly|lo(log ro — log(1 — ro)) + const (13.8) 
= —Al|y|lo + const (13.9) 


where À £ log =m controls the sparsity of the model. 
We can write the likelihood as follows: 


p(P\y) = p(y|X. 7) = | | vtylX.w.rplwlr,0?)p(o?)dwdo? (13.10) 


For notational simplicity, we have assumed the response is centered, (i.e, 7 = 0), so we can 
ignore any offset term p. 

We now discuss the prior p(w|y, 07). If yj = 0, feature j is irrelevant, so we expect wj = 0. 
If y; = 1, we expect wj to be non-zero. If we standardize the inputs, a reasonable prior is 
N (0, 0702,), where o2, controls how big we expect the coefficients associated with the relevant 


variables to be (which is scaled by the overall noise level o°). We can summarize this prior as 
follows: 


do(w;) ify; = 0 


p(wylo”, 7) ai N(w;|0,0202) ify; =1 (13.11) 


The first term is a “spike” at the origin. As 72, — 00, the distribution p(w;|y; = 1) approaches 
a uniform distribution, which can be thought of as a “slab” of constant height. Hence this is 
called the spike and slab model (Mitchell and Beauchamp 1988). 

We can drop the coefficients w; for which w; = 0 from the model, since they are clamped 
to zero under the prior. Hence Equation 13.10 becomes the following (assuming a Gaussian 


likelihood): 


p(D|y) = J [NX INN w 0p, 022 Tp, )plo? aw do” (13.12) 
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where D = ||-¥||o is the number of non-zero elements in y. In what follows, we will generalize 
this slightly by defining a prior of the form p(w|y,07) = .N(w+|Op,,0?Xy) for any positive 
definite matrix y.! 

Given these priors, we can now compute the marginal likelihood. If the noise variance is 
known, we can write down the marginal likelihood (using Equation 13.151) as follows: 


p(Ply,07) = J NOX, PDN (w10, 02 dw, = N (y|0, C} ) (13.13) 


C, ê o?’ X,EX? + 0°In (13.14) 

If the noise is unknown, we can put a prior on it and integrate it out. It is common to use 
p(o”) = IG(o?|a,,b,). Some guidelines on setting a,b can be found in (Kohn et al. 2001). If 
we use a = b = 0, we recover the Jeffrey’s prior, plo?) x o~*. When we integrate out the noise, 
we get the following more complicated expression for the marginal likelihood (Brown et al. 1998): 


p(Dly) =] fo (yly, wy, 07) p(w, |y, 07)p(o? )dw4do* (13.15) 
x XIX, + E7 aS) ig sy (13.16) 

where S'(7) is the RSS: 
Sy) = yy- yX (X]X; +E Xy (13.17) 


See also Exercise 13.4. 
When the marginal likelihood cannot be computed in closed form (e.g., if we are using logistic 
regression or a nonlinear model), we can approximate it using BIC, which has the form 


log p(D|y) © log p(y|X, wy, 6”) — alle jog. (13.18) 


where w., is the ML or MAP estimate based on X4, and ||-y||o is the “degrees of freedom” of 
the model (Zou et al. 2007). Adding the log prior, the overall objective becomes 


log p(y|D) ~ log p(y|X, W, 67) — kl log N — A\|y||o + const (13.19) 


We see that there are two complexity sacle one arising from the BIC approximation to 
the marginal likelihood, and the other arising from the prior on p(y). Obviously these can be 
combined into one overall complexity parameter, which we will denote by A. 


From the Bernoulli-Gaussian model to Zo regularization 


Another model that is sometimes used (e.g., (Kuo and Mallick 1998; Zhou et al. 2009; Soussen 
et al. 2010)) is the following: 


Yilxi Wo? ~ NÒ Yywjtij o’) (13.20) 
yy ~ Ber(70) (13.21) 
wj ~ N(0,02) (13.22) 


1. It is common to use a g-prior of the form © = g(X7X,)71 for reasons explained in Section 7.6.3.1 (see also 
Exercise 13.4). Various approaches have been proposed for setting g, including cross validation, empirical Bayes (Minka 
2000b; George and Foster 2000), hierarchical Bayes (Liang et al. 2008), etc. 
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In the signal processing literature (e.g., (Soussen et al. 2010)), this is called the Bernoulli- 
Gaussian model, although we could also call it the binary mask model, since we can think of 
the y; variables as “masking out” the weights wj. 

Unlike the spike and slab model, we do not integrate out the “irrelevant” coefficients; they 
always exist. In addition, the binary mask model has the form 7; —> y + wj, whereas the spike 
and slab model has the form 7; —> w; — y. In the binary mask model, only the product y;w,; 
can be identified from the likelihood. 

One interesting aspect of this model is that it can be used to derive an objective function that 
is widely used in the (non-Bayesian) subset selection literature. First, note that the joint prior 
has the form 


ply, w) x N(w0, o2 rll — ro) Plo (13.23) 


Hence the scaled unnormalized negative log posterior has the form 


flaw) 4 —207 log p(y,w,y|X) = lly — Xv. * w)|? 
+ Zw? + All-y||o + const (13.24) 
where 
A ê 20? log(1 = (13.25) 


Let us split w into two subvectors, w_., and w4, indexed by the zero and non-zero entries of 
y respectively. Since X(y. x w) = Xw}, we can just set w_, = 0. 

Now consider the case where a. — oo, so we do not regularize the non-zero weights (so 
there is no complexity penalty coming from the marginal likelihood or its BIC approximation). 


In this case, the objective becomes 


f(y, w) = lly — Xyw4||5 + Allo (13.26) 


This is similar to the BIC objective above. 

Instead of keeping track of the bit vector y, we can define the set of relevant variables to 
be the support, or set of non-zero entries, of w. Then we can rewrite the above equation as 
follows: 


fw) = |ly — Xwll? + Alfw]lo (13.27) 


This is called /ọ regularization. We have converted the discrete optimization problem (over 
y € {0, 1}? ) into a continuous one (over w € RP); however, the fo pseudo-norm makes the 
objective very non smooth, so this is still hard to optimize. We will discuss different solutions 
to this in the rest of this chapter. 


Algorithms 


Since there are 2P models, we cannot explore the full posterior, or find the globally optimal 
model. Instead we will have to resort to heuristics of one form or another. All of the methods 
we will discuss involve searching through the space of models, and evaluating the cost f(y) at 
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Figure 13.2 (a) A lattice of subsets of {1, 2, 3,4}. (b) Residual sum of squares versus subset size, on the 
prostate cancer data set. The lower envelope is the best RSS achievable for any set of a given size. Based 
on Figure 3.5 of (Hastie et al. 2001). Figure generated by prostateSubsets. 


each point. This requires fitting the model (i.e., computing argmax p(D|w)), or evaluating its 
marginal likelihood (ie., computing f p(D|w)p(w)dw) at each step. This is sometimes called 
the wrapper method, since we “wrap” our search for the best model (or set of good models) 
around a generic model-fitting procedure. 

In order to make wrapper methods efficient, it is important that we can quickly evaluate the 
score function for some new model, ~y’, given the score of a previous model, ~y. This can be 
done provided we can efficiently update the sufficient statistics needed to compute f(y). This 
is possible provided ~y’ only differs from ~y in one bit (corresponding to adding or removing 
a single variable), and provided f(y) only depends on the data via X}. In this case, we can 
use rank-one matrix updates/ downdates to efficiently compute xt Xy from XIX}. These 
updates are usually applied to the QR decomposition of X. See e.g., (Miller 2002; Schniter et al. 
2008) for details. 


Greedy search 


Suppose we want to find the MAP model. If we use the £)-regularized objective in Equation 13.27, 
we can exploit properties of least squares to derive various efficient greedy forwards search 
methods, some of which we summarize below. For further details, see (Miller 2002; Soussen 
et al. 2010). 


e Single best replacement The simplest method is to use greedy hill climbing, where at each 
step, we define the neighborhood of the current model to be all models than can be reached 
by flipping a single bit of y, i.e., for each variable, if it is currently out of the model, we 
consider adding it, and if it is currently in the model, we consider removing it. In (Soussen 
et al. 2010), they call this the single best replacement (SBR). Since we are expecting a 
sparse solution, we can start with the empty set, y = 0. We are essentially moving through 
the lattice of subsets, shown in Figure 13.2(a). We continue adding or removing until no 
improvement is possible. 

e Orthogonal least squares If we set A = 0 in Equation 13.27, so there is no complexity 
penalty, there will be no reason to perform deletion steps. In this case, the SBR algorithm is 
equivalent to orthogonal least squares (Chen and Wigger 1995), which in turn is equivalent 
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to greedy forwards selection. In this algorithm, we start with the empty set and add the 
best feature at each step. The error will go down monotonically with ||y||o, as shown in 
Figure 13.2(b). We can pick the next best feature j* to add to the current set y, by solving 


j* = arg min min ||y — (Xy,u;) ||? (13.28) 
Je, W 


We then update the active set by setting +!) = y® U {j*}. To choose the next feature to 
add at step t, we need to solve D — D; least squares problems at step t, where Di = |+,| is 
the cardinality of the current active set. Having chosen the best feature to add, we need to 
solve an additional least squares problem to compute w;+1). 

e Orthogonal matching pursuits Orthogonal least squares is somewhat expensive. A simpli- 
fication is to “freeze” the current weights at their current value, and then to pick the next 
feature to add by solving 


j* = arg min min ||y — Xw: — 2x: ||? (13.29) 
iky B i 


This inner optimization is easy to solve: we simply set 6 = X r /\|x.,j||?, where r; = 
y — Xw; is the current residual vector. If the columns are unit norm, we have 


j* = arg maxx, jr; (13.30) 


so we are just looking for the column that is most correlated with the current residual. We 
then update the active set, and compute the new least squares estimate w141 using X¥,_,. 
This method is called orthogonal matching pursuits or OMP (Mallat et al. 1994). This only 
requires one least squares calculation per iteration and so is faster than orthogonal least 
squares, but is not quite as accurate (Blumensath and Davies 2007). 

e Matching pursuits An even more aggressive approximation is to just greedily add the feature 
that is most correlated with the current residual. This is called matching pursuits (Mallat 
and Zhang 1993). This is also equivalent to a method known as least squares boosting 
(Section 16.4.6). 

e Backwards selection Backwards selection starts with all variables in the model (the so- 
called saturated model), and then deletes the worst one at each step. This is equivalent 
to performing a greedy search from the top of the lattice downwards. This can give better 
results than a bottom-up search, since the decision about whether to keep a variable or 
not is made in the context of all the other variables that might depende on it. However, 
this method is typically infeasible for large problems, since the saturated model will be too 
expensive to fit. 

e FoBa The forwards-backwards algorithm of (Zhang 2008) is similar to the single best 
replacement algorithm presented above, except it uses an OMP-like approximation when 
choosing the next move to make. A similar “dual-pass” algorithm was described in (Moghad- 
dam et al. 2008). 

e Bayesian Matching pursuit The algorithm of (Schniter et al. 2008) is similiar to OMP except 
it uses a Bayesian marginal likelihood scoring criterion (under a spike and slab model) instead 
of a least squares objective. In addition, it uses a form of beam search to explore multiple 
paths through the lattice at once. 
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Stochastic search 


If we want to approximate the posterior, rather than just computing a mode (e.g. because we 
want to compute marginal inclusion probabilities), one option is to use MCMC. The standard 
approach is to use Metropolis Hastings, where the proposal distribution just flips single bits. 
This enables us to efficiently compute p(7’|D) given p(y|D). The probability of a state (bit 
configuration) is estimated by counting how many times the random walk visits this state. See 
(O'Hara and Sillanpaa 2009) for a review of such methods, and (Bottolo and Richardson 2010) 
for a very recent method based on evolutionary MCMC. 

However, in a discrete state space, MCMC is needlessly inefficient, since we can compute the 
(unnormalized) probability of a state directly using p(y, D) = exp(—f(v7)); thus there is no 
need to ever revisit a state. A much more efficient alternative is to use some kind of stochastic 
search algorithm, to generate a set S of high scoring models, and then to make the following 
approximation 


ef) 
eyes efan 


See (Heaton and Scott 2009) for a review of recent methods of this kind. 


p(y|D) & (13.31) 


EM and variational inference * 


It is tempting to apply EM to the spike and slab model, which has the form y; + w; — y. We 
can compute p(y; = 1|w,) in the E step, and optimize w in the M step. However, this will not 
work, because when we compute p(y; = 1|w;), we are comparing a delta-function, ôo(w;), with 
a Gaussian pdf, N (w;|0, 02,). We can replace the delta function with a narrow Gaussian, and 
then the E step amounts to classifying w; under the two possible Gaussian models. However, 
this is likely to suffer from severe local minima. 

An alternative is to apply EM to the Bernoulli-Gaussian model, which has the form y; > y + 
wj. In this case, the posterior p(y|D, w) is intractable to compute because all the bits become 
correlated due to explaining away. However, it is possible to derive a mean field approximation 
of the form ] [; ¢(7;)q(w,) (Huang et al. 2007; Rattray et al. 2009). 


£, regularization: basics 


When we have many variables, it is computationally difficult to find the posterior mode of 
p(y|D). And although greedy algorithms often work well (see e.g., (Zhang 2008) for a theoretical 
analysis), they can of course get stuck in local optima. 

Part of the problem is due to the fact that the y; variables are discrete, y; € {0,1}. In 
the optimization community, it is common to relax hard constraints of this form by replacing 
discrete variables with continuous variables. We can do this by replacing the spike-and-slab style 
prior, that assigns finite probability mass to the event that w; = 0, to continuous priors that 
“encourage” w; = 0 by putting a lot of probability density near the origin, such as a zero-mean 
Laplace distribution. This was first introduced in Section 7.4 in the context of robust linear 
regression. There we exploited the fact that the Laplace has heavy tails. Here we exploit the fact 
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Figure 13.3 Illustration of 41 (left) vs 42 (right) regularization of a least squares problem. Based on Figure 
3.12 of (Hastie et al. 2001). 


that it has a spike near u = 0. More precisely, consider a prior of the form 


D D 
p(w|A) = II Lap(w;,|0, 1/A) « if emAlws] (13.32) 


j=i j=1 


We will use a uniform prior on the offset term, p(wo) œ 1. Let us perform MAP estimation with 
this prior. The penalized negative log likelihood has the form 


f(w) = — log p(D|w) — log p(w|\) = NLL(w) + Al|w]|1 (13.33) 


where ||w||1 = Dai |w;| is the 2; norm of w. For suitably large A, the estimate w will be 
sparse, for reasons we explain below. Indeed, this can be thought of as a convex approximation 
to the non-convex lọ objective 


argmin NLL(w) + Al||w]|lo (13.34) 


In the case of linear regression, the ¢; objective becomes 


N 
ion. = 3 Sai Gee ar Ael (13.35) 
= Hsia (13.36) 


where \’ = 2\o?. This method is known as basis pursuit denoising or BPDN (Chen et al. 1998). 
The reason for this term will become clear later. In general, the technique of putting a zero-mean 
Laplace prior on the parameters and performing MAP estimation is called ¢; regularization. 
It can be combined with any convex or non-convex NLL term. Many different algorithms have 
been devised for solving such problems, some of which we review in Section 13.4. 


Why does £; regularization yield sparse solutions? 


We now explain why ¢; regularization results in sparse solutions, whereas £2 regularization does 
not. We focus on the case of linear regression, although similar arguments hold for logistic 
regression and other GLMs. 
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The objective is the following non-smooth objective function: 


min RSS(w) + Al|w||1 (13.37) 
w 
We can rewrite this as a constrained but smooth objective (a quadratic function with linear 
constraints): 
minRSS(w) st. |lwi|1 < B (13.38) 
w 


where B is an upper bound on the ¢;-norm of the weights: a small (tight) bound B corresponds 
to a large penalty À, and vice versa.” Equation 13.38 is known as lasso, which stands for “least 
absolute shrinkage and selection operator” (Tibshirani 1996). We will see why it has this name 
later. 

Similarly, we can write ridge regression 


min RSS(w) + Al|w]|5 (13.39) 


or as a bound constrained form: 
minRSS(w) = s.t.— |lw||J5 < B (13.40) 


In Figure 13.3, we plot the contours of the RSS objective function, as well as the contours of 
the £> and ¢; constraint surfaces. From the theory of constrained optimization, we know that 
the optimal solution occurs at the point where the lowest level set of the objective function 
intersects the constraint surface (assuming the constraint is active). It should be geometrically 
clear that as we relax the constraint B, we “grow” the 4, “ball” until it meets the objective; the 
corners of the ball are more likely to intersect the ellipse than one of the sides, especially in high 
dimensions, because the corners “stick out” more. The corners correspond to sparse solutions, 
which lie on the coordinate axes. By contrast, when we grow the £2 ball, it can intersect the 
objective at any point; there are no “corners”, so there is no preference for sparsity. 

To see this another away, notice that, with ridge regression, the prior cost of a sparse solution, 
such as w = (1,0), is the same as the cost of a dense solution, such as w = (1/\/2,1/\/2), 
as long as they have the same £> norm: 


II(4, 0)ll2 = |I@/v2, 1/v2ll2 = 1 (13.41) 
However, for lasso, setting w = (1,0) is cheaper than setting w = (1/\/2,1/./2), since 
[I(4, O)||2 = 1 < [1 /v2,1/v2\]1 = v2 (13.42) 
The most rigorous way to see that 44 regularization results in sparse solutions is to examine 
conditions that hold at the optimum. We do this in Section 13.3.2. 
Optimality conditions for lasso 
The lasso objective has the form 


f(@) = RSS(@) + Al|w]|1 (13.43) 


2. Equation 13.38 is an example of a quadratic program or QP, since we have a quadratic objective subject to linear 
inequality constraints. Its Lagrangian is given by Equation 13.37. 
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Figure 13.4 Illustration of some sub-derivatives of a function at point xo. Based on a figure at http: 
//en.wikipedia.org/wiki/Subderivative. Figure generated by subgradientPlot. 


Unfortunately, the ||w||; term is not differentiable whenever w; = 0. This is an example of a 
non-smooth optimization problem. 

To handle non-smooth functions, we need to extend the notion of a derivative. We define a 
subderivative or subgradient of a (convex) function f : T — R at a point 0o to be a scalar g 
such that 


f(9) — f(0) > g(@-—%) YO ET (13.44) 


where Z is some interval containing ĝo. See Figure 13.4 for an illustration.’ We define the set of 
subderivatives as the interval [a,b] where a and b are the one-sided limits 


a= lim 169) — (6) m 1 (9) = FO) 


) (13.46) 
630, 0—0 eset 0—0 


The set [a,b] of all subderivatives is called the subdifferential of the function f at 69 and 
is denoted Of (@)|9,. For example, in the case of the absolute value function f(@) = |0|, the 
subderivative is given by 


{—1} if@<0 
of(0) =< [-1,1] if0=0 (13.47) 
{+1} if@>0 


If the function is everywhere differentiable, then Of (0) = (ay. By analogy to the standard 


calculus result, one can show that the point Ê is a local minimum of f iff 0 € Of()|6. 


3. In general, for a vector valued function, we say that g is a subgradient of f at Oo if for all vectors 0, 
f(@) — f(@0) > (8 — 80)” g (13.45) 


so g is a linear lower bound to the function at 00. 
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(a) (b) 


Figure 13.5 Left: soft thresholding. The flat region is the interval [—A, +A]. Right: hard thresholding. 


Let us apply these concepts to the lasso problem. Let us initially ignore the non-smooth 
penalty term. One can show (Exercise 13.1) that 


o 
T aie = @jWj— Cj (13.48) 
aj = 2% r} (13.49) 
i=1 
c = 2) rylyi— w7 5x3) (13.50) 


where w_; is w without component j, and similarly for x; —;j. We see that cj is (proportional 
to) the correlation between the j’th feature x,; and the residual due to the other features, 
r_; =y — X,_;w_,. Hence the magnitude of cj is an indication of how relevant feature j is 
for predicting y (relative to the other features and the current parameters). 

Adding in the penalty term, we find that the subderivative is given by 


Ow, f(w) = (ajwj— cj) + APw,;||W] fa (13.51) 
{ajw; = Cp = A} if Wj <0 
= [-c; — À, =6; + A] if wj = 0 (13.52) 


{ajw; = C+ A} if wj > 0 
We can write this in a more compact fashion as follows: 


{-\}_ if w; <0 
X7(Xw-—y);€¢ [-A,A]_ if w; =0 (13.53) 
{A} if Wj > 0 


Depending on the value of cj, the solution to Ow, f(w) = 0 can occur at 3 different values 
of wj, as follows: 
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1. If c; < —A, so the feature is strongly negatively correlated with the residual, then the 
citar 
<0. 


aj 


subgradient is zero at ŵÙ;j = 


2. Ifc; € [—A, A], so the feature is only weakly correlated with the residual, then the subgradient 
is zero at w; = 0. 


3. If cj > A, so the feature is strongly positively correlated with the residual, then the subgra- 
— ci7A 


dient is zero at Ùj = +— > 0. 
fej 


In summary, we have 


(cj + A)/a; if cj <—-A 
W;(c;) = 0 if Cj E [—A, A] (13.54) 
(cj — à) /aj ifce; >A 


We can write this as follows: 


tice = non (13.55) 
aj aj 
where 
soft (a;6) = sign(a) (lal — 5), (13.56) 


and x; = max(x,0) is the positive part of x. This is called soft thresholding. This is 
illustrated in Figure 13.5(a), where we plot w,; vs cj. The dotted line is the line wj = cj / aj 
corresponding to the least squares fit. The solid line, which represents the regularized estimate 
Ú; (cj), shifts the dotted line down (or up) by À, except when —\ < c; < A, in which case it 
sets wj = 0. 

By contrast, in Figure 13.5(b), we illustrate hard thresholding. This sets values of w; to 
0 if —À < cj < A, but it does not shrink the values of w; outside of this interval. The 
slope of the soft thresholding line does not coincide with the diagonal, which means that even 
large coefficients are shrunk towards zero; consequently lasso is a biased estimator. This is 
undesirable, since if the likelihood indicates (via c;) that the coefficient w; should be large, we 
do not want to shrink it. We will discuss this issue in more detail in Section 13.6.2. 

Now we finally can understand why Tibshirani invented the term “lasso” in (Tibshirani 1996): 
it stands for “least absolute selection and shrinkage operator”, since it selects a subset of the 
variables, and shrinks all the coefficients by penalizing the absolute values. If A = 0, we get the 
OLS solution (of minimal 44 norm). If A > Amaz, we get Ww = 0, where 


Amaz = |X y| lss = mee ly? x. ;| (13.57) 
This value is computed using the fact that O is optimal if (XTy); € [—A, A] for all j. In general, 
the maximum penalty for an @; regularized objective is 


Amaz = max |V;NLL(0)| (13.58) 
J 
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Comparison of least squares, lasso, ridge and subset selection 


We can gain further insight into 7; regularization by comparing it to least squares, and £2 and 
lo regularized least squares. For simplicity, assume all the features of X are orthonormal, so 
XTX = I. In this case, the RSS is given by 


RSS(w) = |ly—Xw\l? =y?y+w' X? Xw — 2w' xX’ y (13.59) 
= const + So we-25 Y WktikYyi (13.60) 
k k a 


so we see this factorizes into a sum of terms, one per dimension. Hence we can write down the 
MAP and ML estimates analytically, as follows: 


e MLE The OLS solution is given by 
DOLS = xly (13.61) 
where X. is the k’th column of X. This follows trivially from Equation 13.60. We see 
that 4S is just the orthogonal projection of feature k onto the response vector (see 
Section 7.3.2). 
e Ridge One can show that the ridge estimate is given by 
DOES 


a ridge 
i = 13.62 
at 1+. 3.62) 
e Lasso From Equation 13.55, and using the fact that a, = 2 and woe = ck /2, we have 
: AÀ 
Ôl = sign(O"5) Cá j= >) (13.63) 
+ 


This corresponds to soft thresholding, shown in Figure 13.5(a). 
e Subset selection If we pick the best K features using subset selection, the parameter 
estimate is as follows 


OLS ; OLS) < 
SS E if rank(|wg => |) < K (13.64) 


w = . 
k 0 otherwise 


where rank refers to the location in the sorted list of weight magnitudes. This corresponds 
to hard thresholding, shown in Figure 13.5(b). 


Figure 13.6(a) plots the MSE vs A for lasso for a degree 14 polynomial, and Figure 13.6(b) plots 
the MSE vs polynomial order. We see that lasso gives similar results to the subset selection 
method. 

As another example, consider a data set concerning prostate cancer. We have D = 8 features 
and N = 67 training cases; the goal is to predict the log prostate-specific antigen levels (see 
(Hastie et al. 2009, p4) for more biological details). Table 13.1 shows that lasso gives better 
prediction accuracy (at least on this particular data set) than least squares, ridge, and best 
subset regression. (In each case, the strength of the regularizer was chosen by cross validation.) 
Lasso also gives rise to a sparse solution. Of course, for other problems, ridge may give better 
predictive accuracy. In practice, a combination of lasso and ridge, known as the elastic net, 
often performs best, since it provides a good combination of sparsity and regularization (see 
Section 13.5.3). 
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Figure 13.6 (a) MSE vs À for lasso for a degree 14 polynomial. Note that A decreases as we move to 
the right. Figure generated by linregPolyLassoDemo. (b) MSE versus polynomial degree. Note that the 
model order increases as we move to the right. See Figure 118 for a plot of some of these polynomial 
regression models. Figure generated by linregPolyVsDegree. 


Term LS Best Subset Ridge Lasso 
Intercept 2.452 2.481 2.479 2.480 
Icavol 0.716 0.651 0.656 0.653 
lweight 0.293 0.380 0.300 0.297 
age -0.143 -0.000 -0.129 -0.119 
lbph 0.212 -0.000 0.208 0.200 
svi 0.310 -0.000 0.301 0.289 
Icp -0.289 -0.000 -0.260 -0.236 
gleason -0.021 -0.000 -0.019 0.000 
pgg45 0.277 0.178 0.256 0.226 
Test Error 0.586 0.572 0.580 0.564 


Table 13.1 Results of different methods on the prostate cancer data, which has 8 features and 67 training 
cases. Methods are: LS = least squares, Subset = best subset regression, Ridge, Lasso. Rows represent 
the coefficients; we see that subset regression and lasso give sparse solutions. Bottom row is the mean 
squared error on the test set (30 cases). Based on Table 3.3. of (Hastie et al. 2009). Figure generated by 
prostateComparison. 


Regularization path 


As we increase À, the solution vector w(A) will tend to get sparser, although not necessarily 
monotonically. We can plot the values w,;(A) vs A for each feature j; this is known as the 
regularization path. 

This is illustrated for ridge regression in Figure 13.7(a), where we plot Ù; (A) as the regularizer 
A decreases. We see that when À = oo, all the coefficients are zero. But for any finite value of 
A, all coefficients are non-zero; furthermore, they increase in magnitude as A is decreased. 

In Figure 13.7(b), we plot the analogous result for lasso. As we move to the right, the upper 
bound on the ¢; penalty, B, increases. When B = 0, all the coefficients are zero. As we increase 
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—6— Icavol 
—2— weight 


Figure 13.7 (a) Profiles of ridge coefficients for the prostate cancer example vs bound on Z2 norm of w, 
so small ¢ (large A) is on the left. The vertical line is the value chosen by 5-fold CV using the 1SE rule. 
Based on Figure 3.8 of (Hastie et al. 2009). Figure generated by ridgePathProstate. (b) Profiles of lasso 
coefficients for the prostate cancer example vs bound on ¢; norm of w, so small t (large À) is on the left. 
Based on Figure 3.10 of (Hastie et al. 2009). Figure generated by lassoPathProstate. 
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Figure 13.8 Illustration of piecewise linearity of regularization path for lasso on the prostate cancer 
example. (a) We plot w,;(B) vs B for the critical values of B. (b) We plot vs steps of the LARS algorithm. 
Figure generated by lassoPathProstate. 


B, the coefficients gradually “turn on”. But for any value between 0 and Bmaz = ||Woxs||1, 
the solution is sparse.’ 

Remarkably, it can be shown that the solution path is a piecewise linear function of B (Efron 
et al. 2004). That is, there are a set of critical values of B where the active set of non-zero 
coefficients changes. For values of B between these critical values, each non-zero coefficient 
increases or decreases in a linear fashion. This is illustrated in Figure 13.8(a). Furthermore, 
one can solve for these critical values analytically. This is the basis of the LARS algorithm 
(Efron et al. 2004), which stands for “least angle regression and shrinkage” (see Section 13.4.2 
for details). Remarkably, LARS can compute the entire regularization path for roughly the same 


4. It is common to plot the solution versus the shrinkage factor, defined as s(B) = B/Bmaz, rather than against B. 
This merely affects the scale of the horizontal axis, not the shape of the curves. 
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Figure 13.9 Example of recovering a sparse signal using lasso. See text for details. Based on Figure 1 of 
(Figueiredo et al. 2007). Figure generated by sparseSensingDemo, written by Mario Figueiredo. 


computational cost as a single least squares fit (namely O(min(N D?, DN7?)). 

In Figure 13.8(b), we plot the coefficients computed at each critical value of B. Now the 
piecewise linearity is more evident. Below we display the actual coefficient values at each step 
along the regularization path (the last line is the least squares solution): 


Listing 13.1 Output of lassoPathProstate 


0 0) (0) 0) 10) 0 0 (0) 
0.4279 (0) (0 (0 10) 10) (0) 0 
0.5015 0.0735 0) (0) (0 (0) (0) o 
0.5610 0.1878 (0) (0) 0.0930 0 10) (0) 
0.5622 0.1890 (0) 0.0036 0.0963 10) 0 0 
0.5797 0.2456 (0) 0.1435 0.2003 0 (0) 0.0901 
0.5864 0-2572 -0.0321 0.1639 0.2082 10) (0 0.1066 
0.6994 0.2910 -0.1337 0.2062 0.3003 -0.2565 (0 0.2452 
0.7164 0.2926 -0.1425 0.2120 0.3096 -0.2890 -0.0209 0.2773 


By changing B from 0 to Bmaz, we can go from a solution in which all the weights are zero 
to a solution in which all weights are non-zero. Unfortunately, not all subset sizes are achievable 
using lasso. One can show that, if D > N, the optimal solution can have at most N variables in 
it, before reaching the complete set corresponding to the OLS solution of minimal 44 norm. In 
Section 13.5.3, we will see that by using an Z2 regularizer as well as an £; regularizer (a method 
known as the elastic net), we can achieve sparse solutions which contain more variables than 
training cases. This lets us explore model sizes between N and D. 
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Model selection 


It is tempting to use ¢, regularization to estimate the set of relevant variables. In some cases, 
we can recover the true sparsity pattern of w*, the parameter vector that generated the data. 
A method that can recover the true model in the N — co limit is called model selection 
consistent. The details on which methods enjoy this property, and when, are beyond the scope 
of this book; see e.g., (Buhlmann and van de Geer 2011) for details. 

Instead of going into a theoretical discussion, we will just show a small example. We first 
generate a sparse signal w* of size D = 4096, consisting of 160 randomly placed +1 spikes. 
Next we generate a random design matrix X of size N x D, where N = 1024. Finally we 
generate a noisy observation y = Xw* + e, where e; ~ \V(0, 0.017). We then estimate w from 
y and X. 

The original w* is shown in the first row of Figure 13.9. The second row is the 4/4 estimate 
Wri using A = 0.1Amaz. We see that this has “spikes” in the right places, but they are too 
small. The third row is the least squares estimate of the coefficients which are estimated to be 
non-zero based on supp(W 1). This is called debiasing, and is necessary because lasso shrinks 
the relevant coefficients as well as the irrelevant ones. The last row is the least squares estimate 
for all the coefficients jointly, ignoring sparsity. We see that the (debiased) sparse estimate 
is an excellent estimate of the original signal. By contrast, least squares without the sparsity 
assumption performs very poorly. 

Of course, to perform model selection, we have to pick À. It is common to use cross validation. 
However, it is important to note that cross validation is picking a value of A that results in good 
predictive accuracy. This is not usually the same value as the one that is likely to recover the 
“true” model. To see why, recall that 4, regularization performs selection and shrinkage, that is, 
the chosen coefficients are brought closer to 0. In order to prevent relevant coefficients from 
being shrunk in this way, cross validation will tend to pick a value of A that is not too large. Of 
course, this will result in a less sparse model which contains irrelevant variables (false positives). 
Indeed, it was proved in (Meinshausen and Buhlmann 2006) that the prediction-optimal value 
of À does not result in model selection consistency. In Section 13.6.2, we will discuss some 
adaptive mechanisms for automatically tuning \ on a per-dimension basis that does result in 
model selection consistency. 

A downside of using ¢, regularization to select variables is that it can give quite different 
results if the data is perturbed slightly. The Bayesian approach, which estimates posterior 
marginal inclusion probabilities, p(y; = 1\D), is much more robust. A frequentist solution to 
this is to use bootstrap resampling (see Section 6.2.1), and to rerun the estimator on different 
versions of the data. By computing how often each variable is selected across different trials, 
we can approximate the posterior inclusion probabilities. This method is known as stability 
selection (Meinshausen and BAijhlmann 2010). 

We can threshold the stability selection (bootstrap) inclusion probabilities at some level, say 
90%, and thus derive a sparse estimator. This is known as bootstrap lasso or bolasso (Bach 
2008). It will include a variable if it occurs in at least 90% of sets returned by lasso (for a fixed 
A). This process of intersecting the sets is a way of eliminating the false positives that vanilla 
lasso produces. The theoretical results in (Bach 2008) prove that bolasso is model selection 
consistent under a wider range of conditions than vanilla lasso. 

As an illustration, we reproduced the experiments in (Bach 2008). In particular, we created 
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Figure 13.10 (a) Probability of selection of each variable (white = large probabilities, black = small proba- 
bilities) vs. regularization parameter for Lasso. As we move from left to right, we decrease the amount of 
regularization, and therefore select more variables. (b) Same as (a) but for bolasso. (c) Probability of correct 
sign estimation vs. regularization parameter. Bolasso (red, dashed) and Lasso (black, plain): The number 
of bootstrap replications is in {2, 4, 8, 16, 32, 64, 128, 256}. Based on Figures 1-3 of (Bach 2008). Figure 
generated by bolassoDemo. 


256 datasets of size N = 1000 with D = 16 variables, of which 8 are relevant. See (Bach 2008) 
for more detail on the experimental setup. For dataset n, variable j, and sparsity level k, define 
S(j, k,n) = Iù; (An, Dn) £ 0). Now defineP(j, k) be the average of S(j, k,n) over the 256 
datasets. In Figure 13.10(a-b), we plot P vs —log(A) for lasso and bolasso. We see that for 
bolasso, there is a large range of A where the true variables are selected, but this is not the 
case for lasso. This is emphasized in Figure 13.10(c), where we plot the empirical probability that 
the correct set of variables is recovered, for lasso and for bolasso with an increasing number of 
bootstrap samples. Of course, using more samples takes longer. In practice, 32 bootstraps seems 
to be a good compromise between speed and accuracy. 

With bolasso, there is the usual issue of picking A. Obviously we could use cross validation, 
but plots such as Figure 13.10(b) suggest another heuristic: shuffle the rows to create a large 
black block, and then pick A to be in the middle of this region. Of course, operationalizing this 
intuition may be tricky, and will require various ad-hoc thresholds (it is reminiscent of the “find 
the knee in the curve” heuristic discussed in Section 11.5.2 when discussing how to pick K for 
mixture models). A Bayesian approach provides a more principled method for selecting A. 


Bayesian inference for linear models with Laplace priors 


We have been focusing on MAP estimation in sparse linear models. It is also possible to perform 
Bayesian inference (see e.g., (Park and Casella 2008; Seeger 2008)). However, the posterior mean 
and median, as well as samples from the posterior, are not sparse; only the mode is sparse. This 
is another example of the phenomenon discussed in Section 5.2.1, where we said that the MAP 
estimate is often untypical of the bulk of the posterior. 

Another argument in favor of using the posterior mean comes from Equation 5.108, which 
showed that that plugging in the posterior mean, rather than the posterior mode, is the optimal 
thing to do if we want to minimize squared prediction error. (Schniter et al. 2008) shows 
experimentally, and (Elad and Yavnch 2009) shows theoretically, that using the posterior mean 
with a spike-and-slab prior results in better prediction accuracy than using the posterior mode 
with a Laplace prior, albeit at slightly higher computational cost. 
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£, regularization: algorithms 


In this section, we give a brief review of some algorithms that can be used to solve £4 regularized 
estimation problems. We focus on the lasso case, where we have a quadratic loss. However, 
most of the algorithms can be extended to more general settings, such as logistic regression (see 
(Yaun et al. 2010) for a comprehensive review of £; regularized logistic regression). Note that this 
area of machine learning is advancing very rapidly, so the methods below may not be state of 
the art by the time you read this chapter. (See (Schmidt et al. 2009; Yaun et al. 2010; Yang et al. 
2010) for some recent surveys.) 


Coordinate descent 


Sometimes it is hard to optimize all the variables simultaneously, but it easy to optimize them 
one by one. In particular, we can solve for the j’th coefficient with all the others held fixed: 


* 
Wu = 


; = argmin f (w + zej) — f (w) (13.65) 


z 


where e; is the j’th unit vector. We can either cycle through the coordinates in a deterministic 
fashion, or we can sample them at random, or we can choose to update the coordinate for 
which the gradient is steepest. 

The coordinate descent method is particularly appealing if each one-dimensional optimization 
problem can be solved analytically For example, the shooting algorithm (Fu 1998; Wu and Lange 
2008) for lasso uses Equation 13.54 to compute the optimal value of w; given all the other 
coefficients. See Algorithm 7 for the pseudo code (and LassoShooting for some Matlab code). 

See (Yaun et al. 2010) for some extensions of this method to the logistic regression case. The 
resulting algorithm was the fastest method in their experimental comparison, which concerned 
document classification with large sparse feature vectors (representing bags of words). Other 
types of data (e.g., dense features and/or regression problems) might call for different algorithms. 


Algorithm 13.1: Coordinate descent for lasso (aka shooting algorithm) 
1 Initialize w = (XTX + \I)-!X7y; 

2 repeat 

3 for j =1,...,D do 


= cj Xr. 
6 Wi = soft( =, a) 


7 until converged; 


LARS and other homotopy methods 


The problem with coordinate descent is that it only updates one variable at a time, so can be 
slow to converge. Active set methods update many variables at a time. Unfortunately, they are 
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more complicated, because of the need to identify which variables are constrained to be zero, 
and which are free to be updated. 

Active set methods typically only add or remove a few variables at a time, so they can take a 
long if they are started far from the solution. But they are ideally suited for generating a set of 
solutions for different values of A, starting with the empty set, i.e., for generating regularization 
path. These algorithms exploit the fact that one can quickly compute w(A;,) from W(A,—1) 
if Ag œ% Ax—1; this is known as warm starting. In fact, even if we only want the solution for 
a single value of A, call it A+, it can sometimes be computationally more efficient to compute 
a set of solutions, from Amaz down to Ax, using warm-starting; this is called a continuation 
method or homotopy method. This is often much faster than directly “cold-starting” at Ax; this 
is particularly true if A, is small. 

Perhaps the most well-known example of a homotopy method in machine learning is the 
LARS algorithm, which stands for “least angle regression and shrinkage” (Efron et al. 2004) (a 
similar algorithm was independently invented in (Osborne et al. 2000b,a)). This can compute 
w(A) for all possible values of A in an efficient manner. 

LARS works as follows. It starts with a large value of A, such that only the variable that is most 
correlated with the response vector y is chosen. Then A is decreased until a second variable 
is found which has the same correlation (in terms of magnitude) with the current residual as 
the first variable, where the residual at step k is defined as rọ = y — X. 7, We, where Fy, is 
the current active set (c.f, Equation 13.50). Remarkably, one can solve for this new value of 
à analytically, by using a geometric argument (hence the term “least angle”). This allows the 
algorithm to quickly “jump” to the next point on the regularization path where the active set 
changes. This repeats until all the variables are added. 

It is necessary to allow variables to be removed from the active set if we want the sequence of 
solutions to correspond to the regularization path of lasso. If we disallow variable removal, we 
get a slightly different algorithm called LAR, which tends to be faster. In particular, LAR costs 
the same as a single ordinary least squares fit, namely O(N D min(N, D)), which is O(N D?) 
if N > D, and O(N?D) if D > N. LAR is very similar to greedy forward selection, and a 
method known as least squares boosting (see Section 16.4.6). 

There have been many attempts to extend the LARS algorithm to compute the full regulariza- 
tion path for £; regularized GLMs, such as logistic regression. In general, one cannot analytically 
solve for the critical values of À. Instead, the standard approach is to start at Amax, and then 
slowly decrease A, tracking the solution as we go; this is called a continuation method or 
homotopy method. These methods exploit the fact that we can quickly compute w(A;) from 
W(Ax—1) if Ax © Ax—1; this is known as warm starting. Even if we don’t want the full path, 
this method is often much faster than directly “cold-starting” at the desired value of A (this is 
particularly true if A is small). 

The method described in (Friedman et al. 2010) combines coordinate descent with this warm- 
starting strategy, and computes the full regularization path for any ¢; regularized GLM. This has 
been implemented in the glmnet package, which is bundled with PMTK. 


Proximal and gradient projection methods 


In this section, we consider some methods that are suitable for very large scale problems, where 
homotopy methods made be too slow. These methods will also be easy to extend to other kinds 
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of regularizers, beyond £1, as we will see later. Our presentation in this section is based on 
(Vandenberghe 2011; Yang et al. 2010). 
Consider a convex objective of the form 


f(@) = L(0) + R(@) (13.66) 


where L(@) (representing the loss) is convex and differentiable, and R(@) (representing the 
regularizer) is convex but not necessarily differentiable. For example, L(0) = RSS(@) and 
R(0) = A||@||ı corresponds to the BPDN problem. As another example, the lasso problem can 
be formulated as follows: L(0) = RSS(@) and R(@) = Ic(@), where C = {0 : ||O||, < B} 


and Ic¢(@) is the indicator function of a convex set C, defined as 


af 0 0EC 
se { +oo otherwise 13.67) 


In some cases, it is easy to optimize functions of the form in Equation 13.66. For example, 
suppose L(@) = RSS(@), and the design matrix is simply X = I. Then the obective becomes 
f(0) = R(@) + $\|@ —y|3. The minimizer of this is given by proxg(y), which is the proximal 
operator for the convex function R, defined by 


1 
prox p(y) = argmin (ne) + gllz — vil) (13.68) 


Intuitively, we are returning a point that minimizes R but which is also close (proximal) to y. 
In general, we will use this operator inside an iterative optimizer, in which case we want to stay 
close to the previous iterate. In this case, we use 


1 
prox p(6;,) = argmin (ne) + zllz — onli) (13.69) 


The key issues are: how do we efficiently compute the proximal operator for different regu- 
larizers R, and how do we extend this technique to more general loss functions L? We discuss 
these issues below. 


Proximal operators 
If R(0) = A||O||1, the proximal operator is given by componentwise soft-thresholding: 
prox p(@) = soft (9, A) (13.70) 


as we showed in Section 13.3.2. If R(@) = A||@||o, the proximal operator is given by componen- 
twise hard-thresholding: 

prox p(@) = hard (0, V2.) (13.71) 
where hard(u, a) = ull(|u| > a). 

If R(@) = Ic(0), the proximal operator is given by the projection onto the set ©: 


prox p(@) = argmin ||z — @||5 = projo(@) (13.72) 
ZEC 
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IPO. - gx) 


Feasible Set | 


Figure 13.11 Illustration of projected gradient descent. The step along the negative gradient, to Ok — gx, 
takes us outside the feasible set. If we project that point onto the closest point in the set we get 
9x41 = proje (Ok — gx). We can then derive the implicit update direction using dy = Ox+41 — Ox. Used 
with kind permission of Mark Schmidt. 


For some convex sets, it is easy to compute the projection operator. For example, to project 
onto the rectangular set defined by the box constraints C = {0 : 4; < 0; < uj} we can use 


G 0; <4; 
projg(@);= 4 9; Gj <0; < uy (13.73) 
Uj 0; = Uj 


To project onto the Euclidean ball C = {0 : ||Ə||2 < 1} we can use 


0 
= Jom 'l@lle>21 
proje (0) { g” lalla < 1 (13.74) 


To project onto the 1-norm ball C = {0 : ||Ə||ı < 1} we can use 

projc(9) = soft(@, A) (13.75) 
where \ = 0 if ||@||,; < 1, and otherwise A is the solution to the equation 

D 

X max(|9;| — A,0) =1 (13.76) 

j=l 
We can implement the whole procedure in O(D) time, as explained in (Duchi et al. 2008). 

We will see an application of these different projection methods in Section 13.5.1.2. 
Proximal gradient method 


We now discuss how to use the proximal operator inside of a gradient descent routine. The 
basic idea is to minimize a simple quadratic approximation to the loss function, centered on the 
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On: 


; 1 

0k41 = argmin R(z) + L(O,) + g7 (z — On) + zZ — 6,\|3 (13.77) 

z k 

where gy, = VL(@;) is the gradient of the loss, t is a constant discussed below, and the last 

term arises from a simple approximation to the Hessian of the loss of the form V?L(0,) ~ al 
Dropping terms that are independent of z, and multiplying by ¢;, we can rewrite the above 

expression in terms of a proximal operator as follows: 


: 1 
0k41 = argmin t,R(z) + gllz — ux||3| = prox, p(ux) (13.78) 
UR = Ox = tk8k (13.79) 
Se. = VL(Ox) (13.80) 


If R(@) = 0, this is equivalent to gradient descent. If R(@) = Ic(@), the method is equivalent 
to projected gradient descent, sketched in Figure 13.11. If R(@) = A||@||ı, the method is 
known as iterative soft thresholding. 

There are several ways to pick tx, or equivalently, a, = 1/t,. Given that a,I is an approxi- 
mation to the Hessian V7L, we require that 


On (O% — Ox-1) ¥ Be — Sk-1 (13.81) 
in the least squares sense. Hence 
(Ox — Ox-1)" (Se — Se-1) 
(0r — 0x1)? (Ok — Ox-1) 
This is known as the Barzilai-Borwein (BB) or spectral stepsize (Barzilai and Borwein 1988; 
Fletcher 2005; Raydan 1997). This stepsize can be used with any gradient method, whether 


proximal or not. It does not lead to monotonic decrease of the objective, but it is much faster 
than standard line search techniques. (To ensure convergence, we require that the objective 


(13.82) 


apk = argmin ||a(0k — 04-1) — (ge — gr—1)|I3 = 
Q 


decrease “on average”, where the average is computed over a sliding window of size M + 1.) 

When we combine the BB stepsize with the iterative soft thresholding technique (for R(@) = 
A||O||1), plus a continuation method that gradually reduces A, we get a fast method for the 
BPDN problem known as the SpaRSA algorithm, which stands for “sparse reconstruction by 
separable approximation” (Wright et al. 2009). However, we will call it the iterative shrinkage and 
thresholding algorithm. See Algorithm 12 for some pseudocode, and SpaRSA for some Matlab 
code. See also Exercise 13.11 for a related approach based on projected gradient descent. 


Nesterov’s method 


A faster version of proximal gradient descent can be obtained by epxanding the quadratic 
approximation around a point other than the most recent parameter value. In particular, consider 
performing updates of the form 


Ox41 = prox,, r(Px = tkgk) (13.83) 

se = VL(dx) (13.84) 
k-1 

Od, = Ok +—— (Ok -— Ok) (13.85) 


k+2 
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Algorithm 13.2: Iterative Shrinkage-Thresholding Algorithm (ISTA) 


1 Input: X € RYx? ye RY, parameters A > 0, M >1,0<s<1; 
2 Initialize 0o = 0, œa = 1, r = y, ào = œ; 


3 repeat 

4 A, = max(s||X7r||,., A) // Adapt the regularizer ; 

5 repeat 

6 g = VL(8); 

7 u=0— 4 g; 

8 6 = soft (u, At); 

9 Update a using BB stepsize in Equation 13.82 ; 

10 until f(@) increased too much within the past M steps; 
ul r = y — X9 // Update residual ; 


2 until A = À; 


Figure 13.12 Representing lasso using a Gaussian scale mixture prior. 


This is known as Nesterov’s method (Nesterov 2004; Tseng 2008). As before, there are a variety 
of ways of setting tx; typically one uses line search. 

When this method is combined with the iterative soft thresholding technique (for R(@) = 
A||O||1), plus a continuation method that gradually reduces A, we get a fast method for the 
BPDN problem known as the fast iterative shrinkage thesholding algorithm or FISTA (Beck 
and Teboulle 2009). 
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EM for lasso 


In this section, we show how to solve the lasso problem using lasso. At first sight, this might 
seem odd, since there are no hidden variables. The key insight is that we can represent the 
Laplace distribution as a Gaussian scale mixture (GSM) (Andrews and Mallows 1974; West 1987) 
as follows: 


2 
Hi —y|w; y í 
Lap(w;|0, 1/7) = zE ves | = [wi Peach, -yari (13.86) 
Thus the Laplace is a GSM where the mixing distibution on the variances is the exponential 
2 2 
distribution, Expon(T?|% = Ga(r?|1, %). Using this decomposition, we can represent the 
lasso model as shown in Figure 13.12. The corresponding joint distribution has the form 
Ply, w,T,0°|X) = N(y|Xw,o°In) N(w]0, D+) 
IG(o°lao, bo) |] [ Garl, 3/2) (13.87) 
J 


where D, = diag(?), and where we have assumed for notational simplicity that X is stan- 
dardized and that y is centered (so we can ignore the offset term u). Expanding out, we 
get 


—N/2 1 =ï 
pwr oX) x (0) ep (zily — Xwll8) (Dsl 


2 
exp(—b,/o7) | | exp(- 7) (13.88) 
j 


Below we describe how to apply the EM algorithm to the model in Figure 13.12.° In brief, in 
the E step we infer T? and oc”, and in the M step we estimate w. The resulting estimate Ŵ is 
the same as the lasso estimator. This approach was first proposed in (Figueiredo 2003) (see also 
(Griffin and Brown 2007; Caron and Doucet 2008; Ding and Harrison 2010) for some extensions). 


Why EM? 


Before going into the details of EM, it is worthwhile asking why we are presenting this approach 
at all, given that there are a variety of other (often much faster) algorithms that directly solve the 
lı MAP estimation problem (see linregFitL1Test for an empirical comparison). The reason 
is that the latent variable perspective brings several advantages, such as the following: 


e Jt provides an easy way to derive an algorithm to find ¢,-regularized parameter estimates for 
a variety of other models, such as robust linear regression (Exercise 11.12) or probit regression 
(Exercise 13.9). 


5. To ensure the posterior is unimodal, one can follow (Park and Casella 2008) and slightly modify the model by 
making the prior variance for the weights depend on the observation noise: p(w; 7?, a?) = N(wj|0, a?r?). The EM 
algorithm is easy to modify. 
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e It suggests trying other priors on the variances besides Ga(r?|1, 7? /2). We will consider 
various extensions below. 


e It makes it clear how we can compute the full posterior, p(w|D), rather than just a MAP 
estimate. This technique is known as the Bayesian lasso (Park and Casella 2008; Hans 2009). 


The objective function 


From Equation 13.88, the complete data penalized log likelihood is as follows (dropping terms 
that do not depend on w) 


1 1 
lew) = — gaily — Xw||5 — zW Aw + const (13.89) 


where A = diag( 4) is the precision matrix for w. 
J 


The E step 


The key is to compute E [ eles]. We can derive this directly (see Exercise 13.8). Alternatively, 


we can derive the full posterior, which is given by the following (Park and Casella 2008): 


[ 2 
p(1/77|w, D) = montana ( a) (13.90) 
J 


(Note that the inverse Gaussian distribution is also known as the Wald distribution.) Hence 


1 
Aas eee (13.91) 
Tj |w;| 
Let A = diag(E [1/77] ,...,E [1/72] ) denote the result of this E step. 
We also need to infer a. It is easy to show that that the posterior is 
1 
p(o?|D, w) = IG(a, + (N)/2, bo + a — Xw)! (y — Xw)) = IG(an, by) (13.92) 
Hence 
t[1/o?] =S £m (13.93) 
bn 
The M step 


The M step consists of computing 
A 1 2_Ż_1 T 
w= argmax — >u|ly — Xw||5 — aw Aw (13.94) 
w 


This is just MAP estimation under a Gaussian prior: 


w = (o7A + XTX) !X]Ty (13.95) 
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However, since we expect many w; = 0, we will have T = 0 for many j, making inverting A 


numerically unstable. Fortunately, we can use the SVD of X, given by X = UDV’", as follows: 


w = PV(VTEV + İD) -D-UTy (13.96) 
wW 
where 
—-1 . 1 f [wj] 
Ņ=A_ =diag(——~-) = diag ere bu Ae (13.97) 
(5 Wale aay 


Caveat 


Since the lasso objective is convex, this method should always find the global optimum. Unfor- 
tunately, this sometimes does not happen, for numerical reasons. In particular, suppose that in 
the true solution, w; # 0. Further, suppose that we set ô; = 0 in an M step. In the following E 
step we infer that T? = 0, so then we set i, = 0 again; thus we can never “undo” our mistake. 
Fortunately, in practice, this situation seems to be rare. See (Hunter and Li 2005) for further 
discussion. 


£, regularization: extensions 


In this section, we discuss various extensions of “vanilla” 44 regularization. 


Group Lasso 


In standard ¢; regularization, we assume that there is a 1:1 correspondence between parameters 
and variables, so that if Ù; = 0, we interpret this to mean that variable j is excluded. But 
in more complex models, there may be many parameters associated with a given variable. In 
particular, we may have a vector of weights for each input, w;. Here are some examples: 


e Multinomial logistic regression Each feature is associated with C different weights, one 
per class. 

e Linear regression with categorical inputs Each scalar input is one-hot encoded into a 
vector of length C. 

èe Multi-task learning In multi-task learning, we have multiple related prediction problems. 
For example, we might have C separate regression or binary classification problems. Thus 
each feature is associated with C different weights. We may want to use a feature for all of 
the tasks or none of the tasks, and thus select weights at the group level (Obozinski et al. 
2007). 


If we use an £; regularizer of the form ||w|| = >); >¢..|wjc|, we may end up with with some 


elements of w; being zero and some not. To prevent this kind of situation, we partition the 
parameter vector into G groups. We now minimize the following objective 


G 
J(w) = NLL(w) + $ Ag||wollo (13.98) 
g=1 
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where 


lwll = J w? (13.99) 
JEg 


is the 2-norm of the group weight vector. If the NLL is least squares, this method is called 
group lasso (Yuan and Lin 2006). 

We often use a larger penalty for larger groups, by setting A, = A/d, where dg is the 
number of elements in group g. For example, if we have groups {1,2} and {3,4,5}, the 
objective becomes 


J(w) = NLL(w) + À va (w? + w3|) + vV34/ (w3 + w3 + ua) (13.100) 


Note that if we had used the square of the 2-norms, the model would become equivalent to 
ridge regression, since 


G 
Vlw = 5 u= Iw (3109 
g=1 g jEg 
By using the square root, we are penalizing the radius of a ball containing the group’s weight 
vector: the only way for the radius to be small is if all elements are small. Thus the square root 
results in group sparsity. 
A variant of this technique replaces the 2-norm with the infinity-norm (Turlach et al. 2005; 
Zhao et al. 2005): 


I|Wolloo = max |w;| (13.102) 
JEg 


It is clear that this will also result in group sparsity. 

An illustration of the difference is shown in Figures 13.13 and 13.14. In both cases, we have a 
true signal w of size D = 2!? = 4096, divided into 64 groups each of size 64. We randomly 
choose 8 groups of w and assign them non-zero values. In the first example, the values are 
drawn from a M (0, 1). In the second example, the values are all set to 1. We then pick a random 
design matrix X of size N x D, where N = 2!° = 1024. Finally, we generate y = Xw + e, 
where e ~ N (0, 1074In). Given this data, we estimate the support of w using 44 or group 4, 
and then estimate the non-zero values using least squares. We see that group lasso does a much 
better job than vanilla lasso, since it respects the known group structure. We also see that the 
és. norm has a tendency to make all the elements within a block to have similar magnitude. 
This is appropriate in the second example, but not the first. (The value of A was the same in all 
examples, and was chosen by hand.) 


GSM interpretation of group lasso 
Group lasso is equivalent to MAP estimation using the following prior 
y G 
2 
— = 13.103 
pwi o xap ( 2 wal (13103) 


6. The slight non-zero “noise” in the Zoo group lasso results is presumably due to numerical errors. 
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Original (D = 4096, number groups = 64, active groups = 8) 
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Figure 13.13 Illustration of group lasso where the original signal is piecewise Gaussian. Top left: original 
signal. Bottom left:: vanilla lasso estimate. Top right: group lasso estimate using a 2 norm on the blocks. 
Bottom right: group lasso estimate using an Zœ norm on the blocks. Based on Figures 3-4 of (Wright et al. 
2009). Figure generated by groupLassoDemo, based on code by Mario Figueiredo. 


Now one can show (Exercise 13.10) that this prior can be written as a GSM, as follows: 


walo’, Tre ~ N(0,0772T4,) (13.104) 
dg +1 
Toly ~ Gal a s) (13.105) 


where dg is the size of group g. So we see that there is one variance term per group, each 
of which comes from a Gamma prior, whose shape parameter depends on the group size, and 
whose rate parameter is controlled by y. Figure 13.15 gives an example, where we have 2 groups, 
one of size 2 and one of size 3. 

This picture also makes it clearer why there should be a grouping effect. Suppose w1, is 
small; then T? will be estimated to be small, which will force w1,2 to be small. Converseley, 
suppose w1 is large; then 77 will be estimated to be large, which will allow w12 to be become 
large as well. 
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Original (D = 4096, number groups = 64, active groups = 8) 
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Figure 13.14 Same as Figure 13.13, except the original signal is piecewise constant. 


Figure 13.15 
size Gp = 3. 


Graphical model for group lasso with 2 groups, the first has size Gi; = 2, the second has 
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Algorithms for group lasso 


There are a variety of algorithms for group lasso. Here we briefly mention two. The first 
approach is based on proximal gradient descent, discussed in Section 13.4.3. Since the regularizer 
is separable, R(w) = >7, ||wa||p, the proximal operator decomposes into G separate operators 
of the form 
prox ,(b) = argmin ||z — b]|3 + Al|z||p (13.106) 
ZERPs 
where b = Okg — tk8kg. If p = 2, one can show (Combettes and Wajs 2005) that this can be 
implemented as follows 


prox p(b) = b — projyc(b) (13.107) 
where C = {z : ||z||2 < 1} is the 4 ball. Using Equation 13.74, if ||b||2 < A, we have 
proxp(b) = b—b=0 (13.108) 


otherwise we have 


b ||bll2 — à 
proxp(b) = b — à =b 
į Ilb[|2 Ilb[2 


We can combine these into a vectorial soft-threshold function as follows (Wright et al. 2009): 


max(||b||2 — A, 0) 
b)=b 13.110 
proxa(b) = o (([blla 1,0) +A Dii 
If p = œ, we use C = {z : ||z||ı < 1}, which is the ¢; ball. We can project onto this in O(d,) 
time using an algorithm described in (Duchi et al. 2008). 
Another approach is to modify the EM algorithm. The method is almost the same as for 


vanilla lasso. If we define T = TA jp where g(j) is the group to which dimension j belongs, 


we can use the same full conditionals for o? and w as before. The only changes are as follows: 


(13.109) 


e We must modify the full conditional for the weight precisions, which are estimated based on 
a shared set of weights: 


1 2 . Fo" 5 
zal w, o“, y, X ~ InverseGaussian( well’ ) (13.111) 
g gll2 
where ||w,||3 = S icy w3. For the E step, we can use 
TE yI 
alia nr 
g gll2 


e We must modify the full conditional for the tuning parameter, which is now only estimated 
based on G values of Ta: 


G 
1 
p(7?|7) = Gala, + G/2,by + 5) 73) (13.113) 
g 
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(a) (b) (c) 


Figure 13.16 (a) Example of the fused lasso. The vertical axis represents array CGH (chromosomal genome 
hybridization) intensity, and the horizontal axis represents location along a genome. Source: Figure 1 of 
(Hoefling 2010). (b) Noisy image. (c) Fused lasso estimate using 2d lattice prior. Source: Figure 2 of 
(Hoefling 2010). Used with kind permission of Holger Hoefling. 


Fused lasso 


In some problem settings (e.g., functional data analysis), we want neighboring coefficients to be 
similar to each other, in addition to being sparse. An example is given in Figure 13.16(a), where 
we want to fit a signal that is mostly “off”, but in addition has the property that neighboring 
locations are typically similar in value. We can model this by using a prior of the form 


: y2 yy 2 
p(wia") x exp a= 5 |w;| — P 5 |wj+1 — wy| (13.114) 
j=l j=l 


This is known as the fused lasso penalty. In the context of functional data analysis, we often 
use X = I, so there is one coefficient for each location in the signal (see Section 4.4.2.3). In this 
case, the overall objective has the form 


N N N-1 
J(w, M1, A2) = YG = wi)? + Aq 5 Jwi] + A2 5 [Wii — UW; (13.115) 
i=1 i=1 i=1 
This is a sparse version of Equation 4.148. 
It is possible to generalize this idea beyond chains, and to consider other graph structures, 
using a penalty of the form 


J(w, 1,2) = So (ys — ws)? +A XO fws| +2 XO ws — w| (13.116) 
sEV sEV (s,t)E€E 


This is called graph-guided fused lasso (see e.g., (Chen et al. 2010). The graph might come 
from some prior knowledge, e.g., from a database of known biological pathways. Another 
example is shown in Figure 13.16(b-c), where the graph structure is a 2d lattice. 
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GSM interpretation of fused lasso 


One can show (Kyung et al. 2010) that the fused lasso model is equivalent to the following 
hierarchical model 


wlo?,tT,w ~ N(0,07E(r,w)) (13.117) 
2 

rn ~ Expon(), j=1:D (13.118) 
y2 

wil ~ Expon(->), j=1:D-1 (13.119) 


where © = Q`}, and Q is a tridiagonal precision matrix with 


1 1 1 
main diagonal = {—+-5 t=} (13.120) 
T? w? w? 
j j=l J 
1 
off diagonal = {-—} (13.121) 
w 


where we have defined wg — wp = 0. This is very similar to the model in Section 4.4.2.3, 
where we used a chain-structured Gaussian Markov random field as the prior, with fixed vari- 
ance. Here we just let the variance be random. In the case of graph-guided lasso, the structure 
of the graph is reflected in the zero pattern of the Gaussian precision matrix (see Section 19.4.4). 


Algorithms for fused lasso 


It is possible to generalize the EM algorithm to fit the fused lasso model, by exploiting the 
Markov structure of the Gaussian prior for efficiency. Direct solvers (which don't use the latent 
variable trick) can also be derived (see e.g., (Hoefling 2010)). However, this model is undeniably 
more expensive to fit than the other variants we have considered. 


Elastic net (ridge and lasso combined) 


Although lasso has proved to be effective as a variable selection technique, it has several 
problems (Zou and Hastie 2005), such as the following: 


e If there is a group of variables that are highly correlated (e.g., genes that are in the same 
pathway), then the lasso tends to select only one of them, chosen rather arbitrarily. (This 
is evident from the LARS algorithm: once one member of the group has been chosen, the 
remaining members of the group will not be very correlated with the new residual and hence 
will not be chosen.) It is usually better to select all the relevant variables in a group. If we 
know the grouping structure, we can use group lasso, but often we don’t know the grouping 
structure. 


e Inthe D >N case, lasso can select at most N variables before it saturates. 


e If N > D, but the variables are correlated, it has been empirically observed that the 
prediction performance of ridge is better than that of lasso. 
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Zou and Hastie (Zou and Hastie 2005) proposed an approach called the elastic net, which is 
a hybrid between lasso and ridge regression, which solves all of these problems. It is apparently 
called the “elastic net” because it is “like a stretchable fishing net that retains ’all the big fish” 
(Zou and Hastie 2005). 


Vanilla version 
The vanilla version of the model defines the following objective function: 
Iw, An, A2) = Ily — Kw |? + Aal|wl|2 + alll 03.122) 


Notice that this penalty function is strictly convex (assuming A> > 0) so there is a unique global 
minimum, even if X is not full rank. 

It can be shown (Zou and Hastie 2005) that any strictly convex penalty on w will exhibit 
a grouping effect, which means that the regression coefficients of highly correlated variables 
tend to be equal (up to a change of sign if they are negatively correlated). For example, if two 
features are equal, so X,; = X.,, one can show that their estimates are also equal, ù; = Wr. 
By contrast, with lasso, we may have that w,; = 0 and w, Æ 0 or vice versa. 


Algorithms for vanilla elastic net 


It is simple to show (Exercise 13.5) that the elastic net problem can be reduced to a lasso problem 
on modified data. In particular, define 


` x 7 y 
x= = 13.12 
i (i) og e iy 


where c = (1 + \2)~2. Then we solve 
Ww = argmin||y — Xw||? + càill] (13.124) 
Ww 


and set w = cw. 

We can use LARS to solve this subproblem; this is known as the LARS-EN algorithm. If we 
stop the algorithm after m variables have been included, the cost is O(m + Dm?). Note that 
we can use m = D if we wish, since X has rank D. This is in contrast to lasso, which cannot 
select more than N variables (before jumping to the OLS solution) if N < D. 

When using LARS-EN (or other 4; solvers), one typically uses cross-validation to select A; and 
A2. 


Improved version 


Unfortunately it turns out that the “vanilla” elastic net does not produce functions that predict 
very accurately, unless it is very close to either pure ridge or pure lasso. Intuitively the reason 
is that it performs shrinkage twice: once due to the Z> penalty and again due to the £; penalty. 
The solution is simple: undo the 2 shrinkage by scaling up the estimates from the vanilla 
version. In other words, if w* is the solution of Equation 13.124, then a better estimate is 


w= /1 +W (13.125) 
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We will call this a corrected estimate. 
One can show that the corrected estimates are given by 


XTX +I 
Ww = arg min w” sa 2y? Xw + Ay||wl|1 (13.126) 
w 1+ A2 
Now 
XTX + dol z 
= (1 s+ pl 13.127 
ese -p8 +p (13.127) 


where p = A2/(1 + A). So the the elastic net is like lasso but where we use a version of 
X that is shrunk towards I. (See Section 4.2.6 for more discussion of regularized estimates of 
covariance matrices.) 


GSM interpretation of elastic net 


The implicit prior being used by the elastic net obviously has the form 


D 
yı Y2 
p(w|a’) x exp a 5 |w,;| — z7 Sou; (13.128) 
j=l 


which is just a product of Gaussian and Laplace distributions. 
This can be written as a hierarchical prior as follows (Kyung et al. 2010; Chen et al. 2011): 


wylo?,77 ~ N(0,07(t77 +72)~*) (13.129) 
y? 
ala ~ Expon(->) (13.130) 


Clearly if y2 = 0, this reduces to the regular lasso. 
It is possible to perform MAP estimation in this model using EM, or Bayesian inference using 
MCMC (Kyung et al. 2010) or variational Bayes (Chen et al. 2011). 


Non-convex regularizers 


Although the Laplace prior results in a convex optimization problem, from a statistical point 
of view this prior is not ideal. There are two main problems with it. First, it does not put 
enough probability mass near 0, so it does not sufficiently suppress noise. Second, it does 
not put enough probability mass on large values, so it causes shrinkage of relevant coefficients, 
corresponding to “signal”. (This can be seen in Figure 13.5(a): we see that 44 estimates of large 
coefficients are significantly smaller than their ML estimates, a phenomenon known as bias.) 

Both problems can be solved by going to more flexible kinds of priors which have a larger 
spike at 0 and heavier tails. Even though we cannot find the global optimum anymore, these 
non-convex methods often outperform ¢, regularization, both in terms of predictive accuracy 
and in detecting relevant variables (Fan and Li 2001; Schniter et al. 2008). We give some examples 
below. 
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Bridge regression 


A natural generalization of 4, regularization, known as bridge regression (Frank and Friedman 
1993), has the form 


W = NLL(w) +A) |w;|’ (13.131) 
J 


for b > 0. This corresponds to MAP estimation using a exponential power distribution given 


by 


A b |x = p| 

ExpPower(w|u, a, b) = Zar F 1/0) exp ( E ) (13.132) 
If b = 2, we get the Gaussian distribution (with a = ov?) corresonding to ridge regression; if 
we set b = 1, we get the Laplace distribution, corresponding to lasso; if we set b = 0, we get 
Lo regression, which is equivalent to best subset selection. Unfortunately, the objective is not 
convex for b < 1, and is not sparsity promoting for b > 1. So the £; norm is the tightest convex 
approximation to the /ọ norm. 

The effect of changing b is illustrated in Figure 13.17, where we plot the prior for b = 2, b = 1 
and b = 0.4; we assume p(w) = p(w})p(we). We also plot the posterior after seeing a single 
Tx, with a 
certain tolerance controlled by the observation noise (compare to Figure 7.11). We see see that 


observation, (x,y), which imposes a single linear constraint of the form, y = w 


the mode of the Laplace is on the vertical axis, corresponding to w; = 0. By contrast, there are 
two modes when using b = 0.4, corresponding to two different sparse solutions. When using 
the Gaussian, the MAP estimate is not sparse (the mode does not lie on either of the coordinate 
axes). 


Hierarchical adaptive lasso 


Recall that one of the principal problems with lasso is that it results in biased estimates. 
This is because it needs to use a large value of to “squash” the irrelevant parameters, but 
this then over-penalizes the relevant parameters. It would be better if we could associate a 
different penalty parameter with each parameter. Of course, it is completely infeasible to tune 
D parameters by cross validation, but this poses no problem to the Bayesian: we simply make 
each 7? have its own private tuning parameter, yj, which are now treated as random variables 


j 
coming from the conjugate prior yj ~ IG(a, b). The full model is as follows: 


yy ~ IG(a,b) (13.133) 
Til ~ Ga(l, 97/2) (13.134) 
wjr? ~ N(0,77) (13.135) 


See Figure 13.18(a). This has been called the hierarchical adaptive lasso (HAL) (Lee et al. 2010) 
(see also (Lee et al. 2011; Cevher 2009; Armagan et al. 2011). We can integrate out T which 
induces a Lap(w,|0,1/y,;) distribution on w; as before. The result is that p(w;) is now a 
scaled mixture of Laplacians. It turns out that we can fit this model (i.e., compute a local 
posterior mode) using EM, as we explain below. The resulting estimate, W774, often works 
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Figure 13.17 Top: plot of log prior for three different distributions with unit variance: Gaussian, Laplace 
and exponential power. Bottom: plot of log posterior after observing a single observation, corresponding 
to a single linear constraint. The precision of this observation is shown by the diagonal lines in the top 
figure. In the case of the Gaussian prior, the posterior is unimodal and symmetric. In the case of the 
Laplace prior, the posterior is unimodal and asymmetric (skewed). In the case of the exponential prior, the 
posterior is bimodal. Based on Figure 1 of (Seeger 2008). Figure generated by sparsePostPlot, written 
by Florian Steinke. 


much better than the estimate returned by lasso, wz, in the sense that it is more likely to 
contain zeros in the right places (model selection consistency) and more likely to result in good 
predictions (prediction consistency) (Lee et al. 2010). We give an explanation for this behavior in 
Section 13.6.2.2. 


EM for HAL 
Since the inverse Gamma is conjugate to the Laplace, we find that the E step for y,; is given by 


p(yj|w;) =IG(a + 1,6 + |w;|) (13.136) 


The E step for ø? is the same as for vanilla lasso. 
The prior for w has the following form: 


1 
p(wly) =]] Fy Pwl) (13.137) 
T 
j 
Hence the M step must optimize 


w J 


2 [1/7] (13.138) 
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Figure 13.18 (a) DGM for hierarchical adaptive lasso. (b) Contours of Hierarchical adpative Laplace. Based 
on Figure 1 of (Lee et al. 2010). Figure generated by normalGammaPenaltyPlotDemo. 


The expectation is given by 


1 
A Ae) (13.139) 
b+ |w; | 


2 [1/93] = 


J 


Thus the M step becomes a weighted lasso problem: 


w+) = argmin ||y — Xw||3 + X s$ lws] (13.140) 
w $ 
Ki 


This is easily solved using standard methods (e.g., LARS). Note that if the coefficient was esti- 
mated to be large in the previous iteration (so w is large), then the scaling factor si) will 
be small, so large coefficients are not penalized heavily. Conversely, small coefficients do get 
penalized heavily. This is the way that the algorithm adapts the penalization strength of each 
coefficient. The result is an estimate that is often much sparser than returned by lasso, but also 
less biased. 

Note that if we set a = b = 0, and we only perform 1 iteration of EM, we get a method that 
is closely related to the adaptive lasso of (Zou 2006; Zou and Li 2008). This EM algorithm is 
also closely related to some iteratively reweighted £; methods proposed in the signal processing 
community (Chartrand and Yin 2008; Candes et al. 2008). 


Understanding the behavior of HAL 


We can get a better understanding of HAL by integrating out y; to get the following marginal 
distribution, 


: —(a+1) 
p(w;|a,b) = = (3 + 1) (13.141) 
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Figure 13.19 Thresholding behavior of two penalty functions (negative log priors). (a) Laplace. 
(b) Hierarchical adaptive Laplace. Based on Figure 2 of (Lee et al. 2010). Figure generated by 
normalGammaThresholdPlotDemo. 


This is an instance of the generalized t distribution (McDonald and Newey 1988) (in (Cevher 
2009; Armagan et al. 2011), this is called the double Pareto distribution) defined as 


A q jew — alt “ot 
GT(w|u, a,c, q) = Zaa B04, a) (1 Ja ) (13.142) 


where c is the scale parameter (which controls the degree of sparsity), and a is related to the 
degrees of freedom. When q = 2 and c = v2 we recover the standard t distribution; when 
a — œ, we recover the exponential power distribution; and when q = 1 and a = o we 
get the Laplace distribution. In the context of the current model, we see that p(w;|a, b) = 
GT(w;|0,a,b/a,1). 

The resulting penalty term has the form 


— log p(w;) = (a + 1) log(1 + Mil) + const (13.143) 


TA (w;) £ 
where A = (a,b) are the tuning parameters. We plot this penalty in 2d (i.e., we plot ma (w1) + 
Ta(w2)) in Figure 13.18(b) for various values of b. Compared to the diamond-shaped Laplace 
penalty, shown in Figure 13.3(a), we see that the HAL penalty looks more like a “star fish”: it 
puts much more density along the “spines”, thus enforcing sparsity more aggressively. Note that 
this penalty is clearly not convex. 

We can gain further understanding into the behavior of this penalty function by considering 
applying it to the problem of linear regression with an orthogonal design matrix. In this case, 
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13.6.3 


G PO) (wy) Ref 
Ga(1, z) Fixed Lap(0, 1/7) (Andrews and Mallows 1974; West 1987) 
Ga(1, 4) IG(a,b) GT(0,a,b/a,1) (Lee et al. 2010, 2011; Cevher 2009; Armagan et al. 2011) 
Ga(1, +) Ga(a,b) NEG(a, b) (Griffin and Brown 2007, 2010; Chen et al. 2011) 
Ga(d, +) Fixed NG(ô, y) (Griffin and Brown 2007, 2010) 
Ga(r?]0,0) - NJ(w;) (Figueiredo 2003) 
IG(é, 7) Fixed T (0, 6,7) (Andrews and Mallows 1974; West 1987) 


Ct(0,)) Ct(0,b) horseshoe(b) (Carvahlo et al. 2010) 


Table 13.2 Some scale mixtures of Gaussians. Abbreviations: C’* = half-rectified Cauchy; Ga = Gamma 
(shape and rate parameterization); GT = generalized t; IG = inverse Gamma; NEG = Normal-Exponential- 
Gamma; NG = Normal-Gamma; NJ = Normal-Jeffreys. The horseshoe distribution is the name we give 
to the distribution induced on w; by the prior described in (Carvahlo et al. 2010); this has no simple 
analytic form. The definitions of the NEG and NG densities are a bit complicated, but can be found in the 
references. The other distributions are defined in the text. 


one can show that the objective becomes 


D 
1 
Jw) = 5lly— Xwll + $ malws) (13.144) 
j=l 
1 1 D D 
= ly- +5 > [aye — ws)? + > mally) (13.145) 
j=l j=l 


where w’!© = XTy is the MLE and y = Xw’”'°. Thus we can compute the MAP estimate 
one dimension at a time by solving the following 1d optimization problem: 

Ùj = argmin Lage — wj)? + ma (wj) (13.146) 

wj 

In Figure 13.19(a) we plot the lasso estimate, ®t, vs the ML estimate, °. We see that the 
lı estimator has the usual soft-thresholding behavior seen earlier in Figure 13.5(a). However, 
this behavior is undesirable since the large magnitude coefficients are also shrunk towards 0, 
whereas we would like them to be equal to their unshrunken ML estimates. 

In Figure 13.19(b) we plot the HAL estimate, WwHAL vs the ML estimate w”’©. We see that 
this approximates the more desirable hard thresholding behavior seen earlier in Figure 13.5(b) 
much more closely. 


Other hierarchical priors 


Many other hierarchical sparsity-promoting priors have been proposed; see Table 13.2 for a brief 
summary. In some cases, we can analytically derive the form of the marginal prior for wj. 
Generally speaking, this prior is not concave. 

A particularly interesting prior is the improper Normal-Jeffreys prior, which has been used 
in (Figueiredo 2003). This puts a non-informative Jeffreys prior on the variance, Ga(r?|0, 0) x 
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1/r?; the resulting marginal has the form p(w;) = NJ(w;) x 1/|w,|. This gives rise to a 
thresholding rule that looks very similar to HAL in Figure 13.19(b), which in turn is very similar 
to hard thresholding. However, this prior has no free parameters, which is both a good thing 
(nothing to tune) and a bad thing (no ability to adapt the level of sparsity). 


Automatic relevance determination (ARD)/sparse Bayesian learning (SBL) 


All the methods we have considered so far (except for the spike-and-slab methods in Sec- 
tion 13.2.1) have used a factorial prior of the form p(w) = [[, p(w;). We have seen how these 
priors can be represented in terms of Gaussian scale mixtures of the form w; ~ N (0, T?), where 
T has one of the priors listed in Table 13.2. Using these latent variances, we can represent the 
model in the form T? — wj —> y + X. We can then use EM to perform MAP estimation, 
where in the E step we infer p(T? lwj), and in the M step we estimate w from y, X and 7. 
This M step either involves a closed-form weighted 2 optimization (in the case of Gaussian 
scale mixtures), or a weighted ¢; optimization (in the case of Laplacian scale mixtures). We also 
discussed how to perform Bayesian inference in such models, rather than just computing MAP 
estimates. 

In this section, we discuss an alternative approach based on type II ML estimation (empirical 
Bayes), whereby we integrate out w and maximize the marginal likelihood wrt 7. This EB 
procedure can be implemented via EM, or via a reweighted ¢; scheme, as we will explain below. 
Having estimated the variances, we plug them in to compute the posterior mean of the weights, 
i {w|7,D]; rather surprisingly (in view of the Gaussian prior), the result is an (approximately) 
sparse estimate, for reasons we explain below. 

In the context of neural networks, this this method is called called automatic relevance 
determination or ARD (MacKay 1995b; Neal 1996): see Section 16.5.7.5. In the context of the 
linear models we are considering in this chapter, this method is called sparse Bayesian learning 
or SBL (Tipping 2001). Combining ARD/SBL with basis function expansion in a linear model 
gives rise to a technique called the relevance vector machine (RVM), which we will discuss in 
Section 14.3.2. 


ARD for linear regression 


We will explain the procedure in the context of linear regression; ARD for GLMs requires the use 
of the Laplace (or some other) approximation. case can be It is conventional, when discussing 
ARD / SBL, to denote the weight precisions by a; = 1/7?, and the measurement precision 
by 8 = 1/o? (do not confuse this with the use of 6 in statistics to represent the regression 
coefficients!). In particular, we will assume the following model: 


p(y|x,w, 8) = N(y\w?x,1/8) (13.147) 
p(w) = N(w|0, Aq") (13.148) 
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where A = diag(a). The marginal likelihood can be computed analytically as follows: 


riyiXa,8) = f N(yiXw, BINN (w0, A)dw (13149) 
= N(y|0, In + XA7'X7) (13.150) 
= (2n)-N/IC,|-2 exp(—5y" Cay) (13.151) 
where 
Ca #8 y+ XA TX? (13.152) 


Compare this to the marginal likelihood in Equation 13.13 in the spike and slab model; modulo 
the 6 = 1/c? factor missing from the second term, the equations are the same, except we have 
replaced the binary yj € {0,1} with continuous a; € R*+. In log form, the objective becomes 
1 
A 


L(a, B) = -5 log p(y|X, a, b) = log|Cal +y" Caty (13.153) 


To regularize the problem, we may put a conjugate prior on each precision, a; ~ Ga(a, b) 
and 8 ~ Ga(c, d). The modified objective becomes 


1 
(a,b) £ -3 log p(ylX, œ, 8) + S log Ga(a;|a, b) + log Ga(Ale, d) (13.154) 
j 


= _log|Ca| +y” C3'y + (aloga; — baj) + clog 8 — d (13.155) 
j 
This is useful when performing Bayesian inference for œ and 8 (Bishop and Tipping 2000). 
However, when performing (type II) point estimation, we will use the improper prior a = b = 
c = d = 0, which results in maximal sparsity. 

Below we describe how to optimize £(a@,() wrt the precision terms œ and 8.” This is a 
proxy for finding the most probable model setting of ~y in the spike and slab model, which in 
turn is closely related to Zọ regularization. In particular, it can be shown (Wipf et al. 2010) that 
the objective in Equation 13.153 has many fewer local optima than the 9 objective, and hence 
is much easier to optimize. 

Once we have estimated œ and 3, we can compute the posterior over the parameters using 


p(w|D, &, 8) =N (u, £) (13.156) 
E`! = BXTX+A (13.157) 
u = BUXTy (13.158) 


The fact that we compute a posterior over w, while simultaneously encouraging sparsity, is why 
the method is called “sparse Bayesian learning”. Nevertheless, since there are many ways to be 
sparse and Bayesian, we will use the “ARD” term instead, even in the linear model context. (In 
addition, SBL is only “being Bayesian” about the values of the coefficients, rather than reflecting 
uncertainty about the set of relevant variables, which is typically of more interest.) 


7. An alternative approach to optimizing 8 is to put a Gamma prior on 8 and to integrate it out to get a Student 
posterior for w (Buntine and Weigend 1991). However, it turns out that this results in a less accurate estimate for 
œ (MacKay 1999). In addition, working with Gaussians is easier than working with the Student distribution, and the 
Gaussian case generalizes more easily to other cases such as logistic regression. 
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Figure 13.20 Illustration of why ARD results in sparsity. The vector of inputs x does not point towards 
the vector of outputs y, so the feature should be removed. (a) For finite a, the probability density is spread 
in directions away from y. (b) When a = oo, the probability density at y is maximized. Based on Figure 
8 of (Tipping 2001). 


Whence sparsity? 


If âj ~ 0, we find ù; ~ wine, since the Gaussian prior shrinking w; towards 0 has zero 


precision. However, if we find that @; ~ oo, then the prior is very confident that w; = 0, and 
hence that feature j is “irrelevant”. Hence the posterior mean will have ù; ~ 0. Thus irrelevant 
features automatically have their weights “turned off” or “pruned out”. 

We now give an intuitive argument, based on (Tipping 2001), about why ML-II should encour- 
age a; — oo for irrelevant features. Consider a ld linear regression with 2 training examples, 
so X = x = (21,22), and y = (y1,y2). We can plot x and y as vectors in the plane, as 
shown in Figure 13.20. Suppose the feature is irrelevant for predicting the response, so x points 
in a nearly orthogonal direction to y. Let us see what happens to the marginal likelihood as we 
change a. The marginal likelihood is given by p(y|x, a, 8) = N (y|0, C), where 
1 
p 
If a is finite, the posterior will be elongated along the direction of x, as in Figure 13.20(a). 
However, if œa = oo, we find C = al, so C is spherical, as in Figure 13.20(b). If |C| is held 
constant, the latter assigns higher probability density to the observed response vector y, so this 
is the preferred solution. In other words, the marginal likelihood “punishes” solutions where a; 


is small but X. ; is irrelevant, since these waste probability mass. It is more parsimonious (from 
the point of view of Bayesian Occam’s razor) to eliminate redundant dimensions. 


1 
C = —I+ —xx? (13.159) 
a 


Connection to MAP estimation 


ARD seems quite different from the MAP estimation methods we have been considering earlier 
in this chapter. In particular, in ARD, we are not integrating out œ and optimizing w, but vice 


13.7.4 


13.7.4.1 
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versa. Because the parameters w; become correlated in the posterior (due to explaining away), 
when we estimate a; we are borrowing information from all the features, not just feature j. 
Consequently, the effective prior p(w|d@) is non-factorial, and furthermore it depends on the 
data D (and co”). However, in (Wipf and Nagarajan 2007), it was shown that ARD can be viewed 
as the following MAP estimation problem: 


wikD — arg min Blly — Xw||} + garp(w) (13.160) 
A : 2 
garp(w) = min) | ajw; + log|Ca| (13.161) 
j 


The proof, which is based on convex analysis, is a little complicated and hence is omitted. 

Furthermore, (Wipf and Nagarajan 2007; Wipf et al. 2010) prove that MAP estimation with 
non-factorial priors is strictly better than MAP estimation with any possible factorial prior in 
the following sense: the non-factorial objective always has fewer local minima than factorial 
objectives, while still satisfying the property that the global optimum of the non-factorial objec- 
tive corresponds to the global optimum of the @9 objective — a property that ¢; regularization, 
which has no local minima, does not enjoy. 


Algorithms for ARD * 


In this section, we review several different algorithms for implementing ARD. 


EM algorithm 


The easiest way to implement SBL/ARD is to use EM. The expected complete data log likelihood 
is given by 


Qla, b) = E [log N(y|Xw, oI) + log N(w|0, A™+)] (13.162) 


1G 


= z | Nlog 8 — Ally — Xw]|? + 5 logaj — tr(Aww”)| + const (13.163) 
J 


Nl] = 


1 
= 5 Nlog8 —§ (lly —Xull? + t(X7XD)) 


1 i 1 T 
5 > log aj — zA lun + %)] + const (13.164) 


where ys and © are computed in the E step using Equation 13.158. 
Suppose we put a Ga(a,b) prior on a; and a Ga(c, d) prior on 8. The penalized objective 
becomes 


Q'(a, b) = Qla, 8) + $ (aloga; — baj) + clog 8 — dp (13.165) 


j 
Setting ao = 0 we get the following M step: 
J 


1l+2a 1+ 2a 
z [w] +2b © mF + Ejj + 2b 


= (13.166) 
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If aj = a, and a = b = 0, the update becomes 


D D 


= = 13.167 
«= uwe] php +t) pee 
The update for 8 is given by 
a ly- Xe +e 30,0 — aE) + 2d iiaa 
new = N 4 2c a 


(Deriving this is Exercise 13.2.) 


Fixed-point algorithm 


A faster and more direct approach is to directly optimize the objective in Equation 13.155. One 
can show (Exercise 13.3) that the equations a = 0 and 5 = 0 lead to the following fixed 
a 

point updates: 


Y+ 2a 
al Meee 13.169 
j m? + 2b ( ) 
— Xul? +2d 
go Ily — Xel? + (13.170) 
N- 2; Yj + 2e 
g 2 ieai (13.171) 


The quantity y; is a measure of how well-determined w; is by the data (MacKay 1992). Hence 
y= D 7; is the effective degrees of freedom of the model. See Section 7.5.3 for further 
discussion. 

Since a and £ both depend on yz and © (which can be computed using Equation 13.158 or the 
Laplace approximation), we need to re-estimate these equations until convergence. (Convergence 
properties of this algorithm have been studied in (Wipf and Nagarajan 2007).) At convergence, 
the results are formally identical to those obtained by EM, but since the objective is non-convex, 
the results can depend on the initial values. 


Iteratively reweighted £1 algorithm 


Another approach to solving the ARD problem is based on the view that it is a MAP estimation 
problem. Although the log prior g(w) is rather complex in form, it can be shown to be a 
non-decreasing, concave function of |w,;|. This means that it can be solved by an iteratively 
reweighted ; problem of the form 


w'*! = arg min NLL(w) + X- AS? wy] (13.172) 
J 


In (Wipf and Nagarajan 2010), the following procedure for setting the penalty terms is suggested 
(based on a convex bound to the penalty function). We initialize with a = 1, and then at 
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iteration t + 1, compute ae by iterating the following equation a few times:® 


I= 


—1 
ye Z (0° + Xdiag(1/à;)diag(lwl™™])) XTX, (13.173) 


I| 


R 


We see that the new penalty A; depends on all the old weights. This is quite different from the 
adaptive lasso method of Section 13.6.2. 
To understand this difference, consider the noiseless case where go? = 0, and assume D > N. 


In this case, there are solutions which perfectly reconstruct the data, Xw = y, and which 


D 
N 
have sparsity ||w||o = N; these are called basic feasible solutions or BFS. What we want are 
solutions that satsify Xw = y but which are much sparser than this. Suppose the method has 
found a BFS. We do not want to increase the penalty on a weight just because it is small (as 
in adaptive lasso), since that will just reinforce our current local optimum. Instead, we want to 
increase the penalty on a weight if it is small and if we have ||[w+!)|| < N. The covariance 
term (Xdiag(1/),)diag(|w*) |))~+ has this effect: if w is a BFS, this matrix will be full rank, 
so the penalty will not increase much, but if w is sparser than N, the matrix will not be full 
rank, so the penalties associated with zero-valued coefficients will increase, thus reinforcing this 
solution (Wipf and Nagarajan 2010). 


ARD for logistic regression 


Now consider binary logistic regression, p(y|x,w) = Ber(y|sigm(w7x)), using the same 
Gaussian prior, p(w) = \’(w|0,A~'). We can no longer use EM to estimate a, since the 
Gaussian prior is not conjugate to the logistic likelihood, so the E step cannot be done exactly. 
One approach is to use a variational approximation to the E step, as discussed in Section 21.8.1.1. 
A simpler approach is to use a Laplace approximation (see Section 8.4.1) in the E step. We can 
then use this approximation inside the same EM procedure as before, except we no longer need 
to update 8. Note, however, that this is not guaranteed to converge. 

An alternative is to use the techniques from Section 13.7.4.3. In this case, we can use exact 
methods to compute the inner weighted /ı regularized logistic regression problem, and no 
approximations are required. 


Sparse coding * 


So far, we have been concentrating on sparse priors for supervised learning. In this section, we 
discuss how to use them for unsupervised learning. 

In Section 12.6, we discussed ICA, which is like PCA except it uses a non-Gaussian prior 
for the latent factors z;. If we make the non-Gaussian prior be sparsity promoting, such as a 
Laplace distribution, we will be approximating each observed vector x; as a sparse combination 
of basis vectors (columns of W); note that the sparsity pattern (controlled by z;) changes from 
data case to data case. If we relax the constraint that W is orthogonal, we get a method called 


8. The algorithm in (Wipf and Nagarajan 2007) is equivalent to a single iteration of Equation 13.173. However, since the 
equation is cheap to compute (only O(N D||w‘t+)||9) time), it is worth iterating a few times before solving the more 
expensive lı problem. 
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Method p(zi) p(W) — W orthogonal 
PCA Gauss - yes 

FA Gauss - no 

ICA Non-Gauss - yes 

Sparse coding Laplace - no 

Sparse PCA Gauss Laplace maybe 

Sparse MF Laplace Laplace no 


Table 13.3 Summary of various latent factor models. A dash “-” in the p(W) column means we are 
performing ML parameter estimation rather than MAP parameter estimation. Summary of abbreviations: 
PCA = principal components analysis; FA = factor analysis; ICA = independent components analysis; MF = 
matrix factorization. 


sparse coding. In this context, we call the factor loading matrix W a dictionary; each column 
is referred to as an atom.’ In view of the sparse representation, it is common for L > D, in 
which case we call the representation overcomplete. 

In sparse coding, the dictionary can be fixed or learned. If it is fixed, it is common to use a 
wavelet or DCT basis, since many natural signals can be well approximated by a small number 
of such basis functions. However, it is also possible to learn the dictionary, by maximizing the 


likelihood 


N 
log p(D|W) = S log f N (xi|Wz;, 071)p(z;)dz; (13.174) 
i=1 Z 
We discuss ways to optimize this below, and then we present several interesting applications. 
Do not confuse sparse coding with sparse PCA (see e.g., (Witten et al. 2009; Journee et al. 
2010): this puts a sparsity promoting prior on the regression weights W, whereas in sparse 
coding, we put a sparsity promoting prior on the latent factors z;. Of course, the two techniques 
can be combined; we call the result sparse matrix factorization, although this term is non- 
standard. See Table 13.3 for a summary of our terminology. 


Learning a sparse coding dictionary 


Since Equation 13.174 is a hard objective to maximize, it is common to make the following 
approximation: 


N 
log p(D|W) ~ 5 max |log N (x;|Wz;, 071) + log p(z;)] (13.175) 
i=1 
If p(z;) is Laplace, we can rewrite the NLL as 
Xi 
NLL(W, Z) = >> 5llx: — Wail|3 + Allzilla (13.176) 


i=l 


9. It is common to denote the dictionary by D, and to denote the latent factors by œ;. However, we will stick with the 
W and z; notation. 
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To prevent W from becoming arbitrarily large, it is common to constrain the /2 norm of its 
columns to be less than or equal to 1. Let us denote this constraint set by 


C={WeER?** st. wiw; <1} (13.177) 


Then we want to solve minwec.zernxz NLL(W, Z). For a fixed z;, the optimization over 
W is a simple least squares problem. And for a fixed dictionary W, the optimization problem 
over Z is identical to the lasso problem, for which many fast algorithms exist. This suggests 
an obvious iterative optimization scheme, in which we alternate between optimizing W and Z. 
(Mumford 1994) called this kind of approach an analysis-synthesis loop, where estimating the 
basis W is the analysis phase, and estimating the coefficients Z is the synthesis phase. In cases 
where this is too slow, more sophisticated algorithms can be used, see e.g., (Mairal et al. 2010). 

A variety of other models result in an optimization problem that is similar to Equation 13.176. 
For example, non-negative matrix factorization or NMF (Paatero and Tapper 1994; Lee and 
Seung 2001) requires solving an objective of the form 


N 
1 
min = 5 Ix: —Wz,||3 st. W>0,z; >0 (13.178) 
WEC,ZERLXN 2 = 


(Note that this has no hyper-parameters to tune.) The intuition behind this constraint is that the 
learned dictionary may be more interpretable if it is a positive sum of positive “parts”, rather 
than a sparse sum of atoms that may be positive or negative. Of course, we can combine NMF 
with a sparsity promoting prior on the latent factors. This is called non-negative sparse coding 
(Hoyer 2004). 

Alternatively, we can drop the positivity constraint, but impose a sparsity constraint on both 
the factors z; and the dictionary W. We call this sparse matrix factorization. To ensure strict 
convexity, we can use an elastic net type penalty on the weights (Mairal et al. 2010) resulting in 


N 

| 

wii 9 2 lies — Walle +All st. lwll? + allw <1 (13.179) 
i= 


There are several related objectives one can write down. For example, we can replace the lasso 
NLL with group lasso or fused lasso (Witten et al. 2009). 

We can also use other sparsity-promoting priors besides the Laplace. For example, (Zhou et al. 
2009) propose a model in which the latent factors z; are made sparse using the binary mask 
model of Section 13.2.2. Each bit of the mask can be generated from a Bernoulli distribution 
with parameter m, which can be drawn from a beta distribution. Alternatively, we can use a 
non-parametric prior, such as the beta process. This allows the model to use dictionaries of 
unbounded size, rather than having to specify L in advance. One can perform Bayesian inference 
in this model using e.g., Gibbs sampling or variational Bayes. One finds that the effective size 
of the dictionary goes down as the noise level goes up, due to the Bayesian Occam’s razor. This 
can prevent overfitting. See (Zhou et al. 2009) for details. 


Results of dictionary learning from image patches 


One reason that sparse coding has generated so much interest recently is because it explains an 
interesting phenomenon in neuroscience. In particular, the dictionary that is learned by applying 
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Figure 13.21 Illustration of the filters learned by various methods when applied to natural image patches. 
(Each patch is first centered and normalized to unit norm.) (a) ICA. Figure generated by icaBasisDemo, 
kindly provided by Aapo Hyvarinen. (b) sparse coding. (c) PCA. (d) non-negative matrix factorization. (e) 
sparse PCA with low sparsity on weight matrix. (f) sparse PCA with high sparsity on weight matrix. Figure 
generated by sparseDictDemo, written by Julien Mairal. 


13.8.3 


13.8.4 
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sparse coding to patches of natural images consists of basis vectors that look like the filters that 
are found in simple cells in the primary visual cortex of the mammalian brain (Olshausen and 
Field 1996). In particular, the filters look like bar and edge detectors, as shown in Figure 13.21(b). 
(In this example, the parameter A was chosen so that the number of active basis functions 
(non-zero components of z;) is about 10.) Interestingly, using ICA gives visually similar results, 
as shown in Figure 13.2l(a). By contrast, applying PCA to the same data results in sinusoidal 
gratings, as shown in Figure 13.21(c); these do not look like cortical cell response patterns.!” It 
has therefore been conjectured that parts of the cortex may be performing sparse coding of the 
sensory input; the resulting latent representation is then further processed by higher levels of 
the brain. 

Figure 13.21(d) shows the result of using NMF, and Figure 13.21(e-f) show the results of sparse 
PCA, as we increase the sparsity of the basis vectors. 


Compressed sensing 


Although it is interesting to look at the dictionaries learned by sparse coding, it is not necessarily 
very useful. However, there are some practical applications of sparse coding, which we discuss 
below. 

Imagine that, instead of observing the data x € R”, we observe a low-dimensional projection 
of it, y = Rx +e where y € R“, R is a M x D matrix, M < D, and € is a noise term 
(usually Gaussian). We assume R is a known sensing matrix, corresponding to different linear 
projections of x. For example, consider an MRI scanner: each beam direction corresponds to a 
vector, encoded as a row in R. Figure 13.22 illustrates the modeling assumptions. 

Our goal is to infer p(x|y, R). How can we hope to recover all of x if we do not measure 
all of x? The answer is: we can use Bayesian inference with an appropriate prior, that exploits 
the fact that natural signals can be expressed as a weighted combination of a small number of 
suitably chosen basis functions. That is, we assume x = Wz, where z has a sparse prior, and 
W is suitable dictionary. This is called compressed sensing or compressive sensing (Candes 
et al. 2006; Baruniak 2007; Candes and Wakin 2008; Bruckstein et al. 2009). 

For CS to work, it is important to represent the signal in the right basis, otherwise it will 
not be sparse. In traditional CS applications, the dictionary is fixed to be a standard form, 
such as wavelets. However, one can get much better performance by learning a domain-specific 
dictionary using sparse coding (Zhou et al. 2009). As for the sensing matrix R, it is often chosen 
to be a random matrix, for reasons explained in (Candes and Wakin 2008). However, one can 
get better performance by adapting the projection matrix to the dictionary (Seeger and Nickish 
2008; Chang et al. 2009). 


Image inpainting and denoising 


Suppose we have an image which is corrupted in some way, e.g., by having text or scratches 
sparsely superimposed on top of it, as in Figure 13.23. We might want to estimate the underlying 


10. The reason PCA discovers sinusoidal grating patterns is because it is trying to model the covariance of the data, which, 
in the case of image patches, is translation invariant. This means cov [I (æ, y), I(x’, y’)] = f [(@— 2’)? + (y—y’)?] 
for some function f, where I(x, y) is the image intensity at location (x, y). One can show (Hyvarinen et al. 2009, p125) 
that the eigenvectors of a matrix of this kind are always sinusoids of different phases, i.e., PCA discovers a Fourier basis. 
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Figure 13.22 Schematic DGM for compressed sensing. We observe a low dimensional measurement y 
generated by passing x through a measurement matrix R, and possibly subject to observation noise with 
variance g^. We assume that x has a sparse decomposition in terms of the dictionary W and the latent 
variables z. the parameter A controlls the sparsity level. 


Figure 13.23 An example of image inpainting using sparse coding. Left: original image. Right: recon- 
struction. Source: Figure 13 of (Mairal et al. 2008). Used with kind permission of Julien Mairal. 


“clean” image. This is called image inpainting. One can use similar techniques for image 
denoising. 

We can model this as a special kind of compressed sensing problem. The basic idea is as 
follows. We partition the image into overlapping patches, y;, and concatenate them to form y. 
We define R so that the i'th row selects out patch 7. Now define Y to be the visible (uncorrupted) 
components of y, and H to be the hidden components. To perform image inpainting, we just 
compute p(y7|yv,9), where 0 are the model parameters, which specify the dictionary W and 
the sparsity level A of z. We can either learn a dictionary offline from a database of images, or 
we can learn a dictionary just for this image, based on the non-corrupted patches. 

Figure 13.23 shows this technique in action. The dictionary (of size 256 atoms) was learned 
from 7 x 10° undamaged 12 x 12 color patches in the 12 mega-pixel image. 

An alternative approach is to use a graphical model (e.g., the fields of experts model (S. 
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and Black 2009)) which directly encodes correlations between neighboring image patches, rather 
than using a latent variable model. Unfortunately such models tend to be computationally more 
expensive. 


Exercises 


Exercise 13.1 Partial derivative of the RSS 


Define 
RSS(w) = ||Xw — yll? (13.180 
a. Show that 
ð 
—— RSS(w) = agwr— cr (13.181 
wk 
ar = 2X ay Filial? (13.182 
i=1 
Ck = 2Y talys — WIkXi, k) = 2X; ptk (13.183 
i=l 


where w_~ = w without component k, Xi,—ẹ is x; without component k, and rk = y — wo X:,—k 
is the residual due to using all the features except feature k. Hint: Partition the weights into those 
involving k and those not involving k. 


b. Show that if +2- RSS(w) = 0, then 


Ow, 
ae a 


Wk = (13.184) 


[xx]? 


Hence when we sequentially add features, the optimal weight for feature k is computed by computing 
orthogonally projecting x,,, onto the current residual. 


Exercise 13.2 Derivation of M step for EB for linear regression 
Derive Equations 13.166 and 13.168. Hint: the following identity should be useful 


EXX = DK7X+6'1DSA-B EA (13.185) 
= Dd(x*xe+A)p* =f SA (13.186) 
= (A+6X7X)1(K7X64+A)8 '-B°SA (13.187) 
= (I-Ad)p' (13.188) 


Exercise 13.3 Derivation of fixed point updates for EB for linear regression 


Derive Equations 13.169 and 13.170. Hint: The easiest way to derive this result is to rewrite log p(D|a, 8) 
as in Equation 8.54. This is exactly equivalent, since in the case of a Gaussian prior and likelihood, the 
posterior is also Gaussian, so the Laplace “approximation” is exact. In this case, we get 


N 
logp(Dla, 8) = log 8— 8 ily - Xwl|? 


+5 = log aj — 5m” Am + Zlo || — P tog(2n) (13.189) 


The rest is straightforward algebra. 
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Exercise 13.4 Marginal likelihood for linear regression 


Suppose we use a g-prior of the form © = g(X4 X,)~+. Show that Equation 13.16 simplifies to 


ply) x (1+g)~?7/?(2b6 + S(y)) Cnet VP (13.190) 
S(y) = yy- TPG XK) X y (13.191) 


Exercise 13.5 Reducing elastic net to lasso 


Define 

Ji(w) = ly — Xw]? + A2|w|? + Alw] (13.192) 
and 

Jo(w) = |ý — Xw|? + Ailwi (13.193) 


~ x ” y 
X = = 13.194 
e (Au) 72 ca 


Show 


arg min Jı (w) = c(arg min J2(w)) (13.195) 


Ji (cw) = Jo(w) (13.196) 
and hence that one can solve an elastic net problem using a lasso solver on modified data. 


Exercise 13.6 Shrinkage in linear regression 
(Source: Jaakkola.) Consider performing linear regression with an orthonormal design matrix, so ||x.,,||3 = 


1 for each column (feature) k, and a) me, j = 0, so we can estimate each parameter wz separately. 


Figure 13.24 plots Wr vs ch = 2y7 x. k, the correlation of feature k with the response, for 3 different 
esimation methods: ordinary least squares (OLS), ridge regression with parameter 2, and lasso with 
parameter Ay. 


a. Unfortunately we forgot to label the plots. Which method does the solid (1), dotted (2) and dashed (3) 
line correspond to? Hint: see Section 13.3.3. 


b. What is the value of 12 
c. What is the value of A22 


Exercise 13.7 Prior for the Bernoulli rate parameter in the spike and slab model 


Consider the model in Section 13.2.1. Suppose we put a prior on the sparsity rates, 7; ~ Beta(a1, a2). 
Derive an expression for p(y|&œ) after integrating out the 7;’s. Discuss some advantages and disadvantages 
of this approach compared to assuming 7; = To for fixed To. 
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Figure 13.24 Plot of w vs amount of correlation cp for three different estimators. 


Exercise 13.8 Deriving E step for GSM prior 
Show that 


B [Zle] s aa 


; |w; 


where m(wj) = — log p(w;) and p(w;) = intN(w;|0,77)p(77 )dr7. Hint 1: 


1 2 1 w; 
ae Cel) x 7 lga) 
_ -1 —2wj w? 
Jwz] 27? exp( 2 2) 
-1 d 
= N (w50, 77) 
|w;| dlw;| . a 


Hint 2: 


d d 
p(w;) = log p(w; 
Teng Os) = luz) eng] EPCs) 


Exercise 13.9 EM for sparse probit regression with Laplace prior 


(13.197) 


(13.198) 


(13.199) 


(13.200) 


(13.201) 


Derive an EM algorithm for fitting a binary probit classifier (Section 9.4) using a Laplace prior on the 


weights. (If you get stuck, see (Figueiredo 2003; Ding and Harrison 2010).) 


Exercise 13.10 GSM representation of group lasso 


Consider the prior T? ~ Ga(ô, p?/2), ignoring the grouping issue for now. The marginal distribution 
induced on the weights by a Gamma mixing distribution is called the normal Gamma distribution and is 
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given by 
NG(w;|5,p) = J N(usl0,7?)Gacrf|s, 0*/2)a7} (13.202) 
re 
= glvl? a Ks (plws) (13.203) 
pta 
1/Z = (13.204) 


Jr 2? p0) 
where Ka(x) is the modified Bessel function of the second kind (the besselk function in Matlab). 


Now suppose we have the following prior on the variances 


G 
plain) = | | ploia), Poia) = | [ Galla 0/2) (13.205) 


g=1 JEg 


The corresponding marginal for each group of weights has the form 


P(Wwg) x Jug? 7%? Ks, -a /2(pug) (13.206) 


us 2 [$ w3; = llwoll2 aan 
JEg 


Now suppose ôg = (dg +1)/2, so 5g — dg /2 = 5. Conveniently, we have Kı (z) = /s5 exp(—z). Show 
that the resulting MAP estimate is equivalent to group lasso. 


Exercise 13.11 Projected gradient descent for ¢; regularized least squares 


Consider the BPDN problem argming RSS(@) + A||@||1. By using the split variable trick introducted in 
Section 7.4 (i.e by defining © = [0+,0-]), rewrite this as a quadratic program with a simple bound 
constraint. Then sketch how to use projected gradient descent to solve this problem. (If you get stuck, 
consult (Figueiredo et al. 2007).) 


Exercise 13.12 Subderivative of the hinge loss function 

Let f(x) = (1 — x) be the hinge loss function, where (z)+ = max(0, z). What are Of (0), Of (1), and 
Of (2)? 

Exercise 13.13 Lower bounds to convex functions 


Let f be a convex function. Explain how to find a global affine lower bound to f at an arbitrary point 
x € dom(f). 


14.1 


14.2 


Kernels 


Introduction 


So far in this book, we have been assuming that each object that we wish to classify or cluster 
or process in anyway can be represented as a fixed-size feature vector, typically of the form 
x; € RP. However, for certain kinds of objects, it is not clear how to best represent them 
as fixed-sized feature vectors. For example, how do we represent a text document or protein 
sequence, which can be of variable length? or a molecular structure, which has complex 3d 
geometry? or an evolutionary tree, which has variable size and shape? 

One approach to such problems is to define a generative model for the data, and use the 
inferred latent representation and/or the parameters of the model as features, and then to plug 
these features in to standard methods. For example, in Chapter 28, we discuss deep learning, 
which is essentially an unsupervised way to learn good feature representations. 

Another approach is to assume that we have some way of measuring the similarity between 
objects, that doesn’t require preprocessing them into feature vector format. For example, when 
comparing strings, we can compute the edit distance between them. Let k(x, x’) > 0 be some 
measure of similarity between objects x, x’ € X, where ¥ is some abstract space; we will call « 
a kernel function. Note that the word “kernel” has several meanings; we will discuss a different 
interpretation in Section 14.7.1. 

In this chapter, we will discuss several kinds of kernel functions. We then describe some 
algorithms that can be written purely in terms of kernel function computations. Such methods 
can be used when we don’t have access to (or choose not to look at) the “inside” of the objects 
x that we are processing. 


Kernel functions 


We define a kernel function to be a real-valued function of two arguments, «(x,x’) € R, for 
x, x’ € X. Typically the function is symmetric (i.e., K(x, x’) = K(x’, x)), and non-negative (ie., 
«(x,x’) > 0), so it can be interpreted as a measure of similarity, but this is not required. We 
give several examples below. 


14.2.1 
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RBF kernels 


The squared exponential kernel (SE kernel) or Gaussian kernel is defined by 


1 
K(x, x’) = exp (-3¢ - x’) TS (x — x)) (14.1) 
If © is diagonal, this can be written as 
Pees 
k(x, x’) = exp J 5 3 (x; wy (14.2) 
j=l J 


We can interpret the a; as defining the characteristic length scale of dimension j. If 7; = 00, 
the corresponding dimension is ignored; hence this is known as the ARD kernel. If X is 
spherical, we get the isotropic kernel 


_ yl ||2 
k(x, x’) = exp (-- 3") (14.3) 


20? 


Here ø? is known as the bandwidth. Equation 14.3 is an example of a a radial basis function 
or RBF kernel, since it is only a function of ||x — x’|]. 


Kernels for comparing documents 


When performing document classification or retrieval, it is useful to have a way of comparing 
two documents, x; and x;. If we use a bag of words representation, where x;,; is the number 
of times words j occurs in document i, we can use the cosine similarity, which is defined by 


T 
X; Xi 


lxillellx l2 


This quantity measures the cosine of the angle between x; and x; when interpreted as vectors. 


k(x Xi) = (14.4) 


Since x; is a count vector (and hence non-negative), the cosine similarity is between 0 and 1, 
where 0 means the vectors are orthogonal and therefore have no words in common. 

Unfortunately, this simple method does not work very well, for two main reasons. First, if x; 
has any word in common with xx, it is deemed similar, even though some popular words, such 
as “the” or “and” occur in many documents, and are therefore not discriminative. (These are 
known as stop words.) Second, if a discriminative word occurs many times in a document, the 
similarity is artificially boosted, even though word usage tends to be bursty, meaning that once 
a word is used in a document it is very likely to be used again (see Section 3.5.5). 

Fortunately, we can significantly improve performance using some simple preprocessing. The 
idea is to replace the word count vector with a new feature vector called the TF-IDF representa- 
tion, which stands for “term frequency inverse document frequency”. We define this as follows. 
First, the term frequency is defined as a log-transform of the count: 


tf(x;;) 4 log(1 + Taig) (14.5) 
This reduces the impact of words that occur many times within one document. Second, the 


inverse document frequency is defined as 


N 


idf(j) £ lo 
mR +X, Iei > 0) 


(14.6) 
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where N is the total number of documents, and the denominator counts how many documents 
contain term j. Finally, we define 


tfidf(x;) = [tf(aij) x idf(7)] Y (14.7) 


(There are several other ways to define the tf and idf terms, see (Manning et al. 2008) for details.) 
We then use this inside the cosine similarity measure. That is, our new kernel has the form 


(xi) (xv) 
lolll )Il2 


where (x) = tf-idf(x). This gives good results for information retrieval (Manning et al. 2008). 
A probabilistic interpretation of the tf-idf kernel is given in (Elkan 2005). 


K(Xi, Xi’ ) = (14.8) 


Mercer (positive definite) kernels 


Some methods that we will study require that the kernel function satisfy the requirement that 
the Gram matrix, defined by 
K(X1,%1) «++ K(X1, Xy) 
K= : (14.9) 
K(Xy,X1) © KXN, XN) 
be positive definite for any set of inputs {x;}^;. We call such a kernel a Mercer kernel, or 
positive definite kernel. It can be shown (Schoelkopf and Smola 2002) that the Gaussian kernel 
is a Mercer kernel as is the cosine similarity kernel (Sahami and Heilman 2006). 


The importance of Mercer kernels is the following result, known as Mercer’s theorem. If the 
Gram matrix is positive definite, we can compute an eigenvector decomposition of it as follows 


K = "AU (14.10) 
where A is a diagonal matrix of eigenvalues \; > 0. Now consider an element of K: 

kij = (A? U.) (A?2U,;) (14.11) 
Let us define o(x;) = AZU.. Then we can write 

kij = $(xi)” (x5) (14.12) 


Thus we see that the entries in the kernel matrix can be computed by performing an inner 
product of some feature vectors that are implicitly defined by the eigenvectors U. In general, if 
the kernel is Mercer, then there exists a function @ mapping x € ¥ to RP such that 


K(x, x’) = (x) P(x’) (14.13) 


where @ depends on the eigen functions of « (so D is a potentially infinite dimensional space). 

For example, consider the (non-stationary) polynomial kernel «(x,x’) = (yx?x’ + r)™, 
where r > 0. One can show that the corresponding feature vector @(x) will contain all terms 
up to degree M. For example, if M = 2, y = r = 1 and x, x’ € R?, we have 


(l+x7x’)? = (1+) + rr)? (14.14) 
= 1422 2) +2rr, + (z121)? + (£281) + 22121 £22) (14.15) 
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14.2.5 


482 Chapter 14. Kernels 


This can be written as (x)? @(x’), where 
(x) = [1, V2a1, V222, 27, 23, V2a1 09)" (14.16) 


So using this kernel is equivalent to working in a 6 dimensional feature space. In the case of 
a Gaussian kernel, the feature map lives in an infinite dimensional space. In such a case, it is 
clearly infeasible to explicitly represent the feature vectors. 

An example of a kernel that is not a Mercer kernel is the so-called sigmoid kernel, defined 


by 
K(x, x’) = tanh(7x? x’ + r) (14.17) 


(Note that this uses the tanh function even though it is called a sigmoid kernel.) This kernel 
was inspired by the multi-layer perceptron (see Section 16.5), but there is no real reason to use 
it. (For a true “neural net kernel”, which is positive definite, see Section 15.4.5.) 

In general, establishing that a kernel is a Mercer kernel is difficult, and requires techniques 
from functional analysis. However, one can show that it is possible to build up new Mercer 
kernels from simpler ones using a set of standard rules. For example, if kı and K2 are both 
Mercer, so is K(x, x’) = K1 (x, x’) + K(x, x’). See e.g., (Schoelkopf and Smola 2002) for details. 


Linear kernels 


Deriving the feature vector implied by a kernel is in general quite difficult, and only possible if 
the kernel is Mercer. However, deriving a kernel from a feature vector is easy: we just use 


K(x, x!) = (x) P(x’) = (A(x), O(x’)) (14.18) 
If (x) = x, we get the linear kernel, defined by 
k(x, x’) = x?! (14.19) 


This is useful if the original data is already high dimensional, and if the original features are 
individually informative, e.g., a bag of words representation where the vocabulary size is large, 
or the expression level of many genes. In such a case, the decision boundary is likely to be 
representable as a linear combination of the original features, so it is not necessary to work in 
some other feature space. 

Of course, not all high dimensional problems are linearly separable. For example, images are 
high dimensional, but individual pixels are not very informative, so image classification typically 
requires non-linear kernels (see e.g., Section 14.2.7). 


Matern kernels 


The Matern kernel, which is commonly used in Gaussian process regression (see Section 15.2), 
has the following form 


, 2 V2vr j 7 2vr 
no Fa 7 ) x, ( 7 ) (14.20) 
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where r = ||x — x’||, v > 0, £ > 0, and K, is a modified Bessel function. As v — oo, this 
approaches the SE kernel. If v = Z, the kernel simplifies to 


K(r) = exp(—r/£) (14.21) 


If D = 1, and we use this kernel to define a Gaussian process (see Chapter 15), we get the 
Ornstein-Uhlenbeck process, which describes the velocity of a particle undergoing Brownian 
motion (the corresponding function is continuous but not differentiable, and hence is very 
‘jagged”). 


String kernels 


The real power of kernels arises when the inputs are structured objects. As an example, we now 
describe one way of comparing two variable length strings using a string kernel. We follow the 
presentation of (Rasmussen and Williams 2006, p100) and (Hastie et al. 2009, p668). 

Consider two strings x, and x’ of lengths D, D’, each defined over the alphabet A. 
For example, consider two amino acid sequences, defined over the 20 letter alphabet A = 
{A, R, N, D,C, E,Q,G, H,I, L, K, M, F, P,S,T,W,Y,V}. Let x be the following sequence 
of length 110 


IPTSALVKETLALLSTHRTLLIANETLRIPVPVHKNHQLCTEEIFQGIGTLESQTVQGGTV 
ERLFKNLSLIKKY IDGQKKKCGEERRRVNQFLDYLQEFLGVMNTEWL 


and let x’ be the following sequence of length 153 


PHRRDLCSRSIWLARKIRSDLTALTESYVKHQGLWSELTEAERLQENLQAYRTFHVLLA 
RLLEDQQVHFTPTEGDFHQATHTLLLQVAAFAYQIEELMILLEYKIPRNEADGMLFEKK 
LWGLKVLQELSQWTVRSTHDLRFISSHQTGIP 


These strings have the substring LQE in common. We can define the similarity of two strings 
to be the number of substrings they have in common. 

More formally and more generally, let us say that s is a substring of x if we can write x = usv 
for some (possibly empty) strings u, s and v. Now let s(x) denote the number of times that 
substring s appears in string x. We define the kernel between two strings x and x’ as 


n(x, 2')= X. webs(2)bs(2’) (14.22) 


seEA* 


where ws > 0 and A* is the set of all strings (of any length) from the alphabet A (this is known 
as the Kleene star operator). This is a Mercer kernel, and be computed in O(|z| + |2’|) time 
(for certain settings of the weights {w,}) using suffix trees (Leslie et al. 2003; Vishwanathan and 
Smola 2003; Shawe-Taylor and Cristianini 2004). 

There are various cases of interest. If we set ws = 0 for |s| > 1 we get a bag-of-characters 
kernel. This defines (x) to be the number of times each character in A occurs in x. If we 
require s to be bordered by white-space, we get a bag-of-words kernel, where @(x) counts how 
many times each possible word occurs. Note that this is a very sparse vector, since most words 


14.2.7 


484 Chapter 14. Kernels 


X 


REWA 


< k N Py y 
optimal partial 
matching 


HER, 2, SOs) 
X;EX 


X= {Runa} Y = {fen 


Figure 14.1 Illustration of a pyramid match kernel computed from two images. Used with kind permission 
of Kristen Grauman. 


will not be present. If we only consider strings of a fixed length k, we get the k-spectrum 
kernel. This has been used to classify proteins into SCOP superfamilies (Leslie et al. 2003). For 
example if k = 3, we have ġzQog(x) = 1 and ġrog(x') = 2 for the two strings above. 

Various extensions are possible. For example, we can allow character mismatches (Leslie et al. 
2003). And we can generalize string kernels to compare trees, as described in (Collins and Duffy 
2002). This is useful for classifying (or ranking) parse trees, evolutionary trees, etc. 


Pyramid match kernels 


In computer vision, it is common to create a bag-of-words representation of an image by 
computing a feature vector (often using SIFT (Lowe 1999)) from a variety of points in the image, 
commonly chosen by an interest point detector. The feature vectors at the chosen places are 
then vector-quantized to create a bag of discrete symbols. 

One way to compare two variable-sized bags of this kind is to use a pyramid match kernel 
(Grauman and Darrell 2007). The basic idea is illustrated in Figure 14.1. Each feature set is 
mapped to a multi-resolution histogram. These are then compared using weighted histogram 
intersection. It turns out that this provides a good approximation to the similarity measure one 
would obtain by performing an optimal bipartite match at the finest spatial resolution, and then 
summing up pairwise similarities between matched points. However, the histogram method is 
faster and is more robust to missing and unequal numbers of points. This is a Mercer kernel. 


14.2.8 


14.2.8.1 


14.2.8.2 


14.2. Kernel functions 485 


Kernels derived from probabilistic generative models 


Suppose we have a probabilistic generative model of feature vectors, p(x|0). Then there are 
several ways we can use this model to define kernel functions, and thereby make the model 
suitable for discriminative tasks. We sketch two approaches below. 


Probability product kernels 


One approach is to define a kernel as follows: 
K(x, X;) = [plextor (14.23) 


where p > 0, and p(x|x;) is often approximated by p(x|@(x;)), where @(x;) is a parameter 
estimate computed using a single data vector. This is called a probability product kernel 
Jebara et al. 2004). 

Although it seems strange to fit a model to a single data point, it is important to bear in 
mind that the fitted model is only being used to see how similar two objects are. In particular, 
if we fit the model to x; and then the model thinks x; is likely, this means that x; and x; are 
similar. For example, suppose p(x|@) = N (pu, o°I), where o? is fixed. If p = 1, and we use 
fu(x;) = x; and ju(x,;) = Xj, we find Jebara et al. 2004, p825) that 


1 1 
K(Xi,Xj) = exp |x; — x,||? (14.24) 
J (4ro?)P/2 402 4 


which is (up to a constant factor) the RBF kernel. 

It turns out that one can compute Equation 14.23 for a variety of generative models, including 
ones with latent variables, such as HMMs. This provides one way to define kernels on variable 
length sequences. Furthermore, this technique works even if the sequences are of real-valued 
vectors, unlike the string kernel in Section 14.2.6. See (Jebara et al. 2004) for further details. 


Fisher kernels 


A more efficient way to use generative models to define kernels is to use a Fisher kernel 
Jaakkola and Haussler 1998) which is defined as follows: 


K(x, x’) = g(x)? F-1g(x’) (14.25) 
where g is the gradient of the log likelihood, or score vector, evaluated at the MLE ð 

g(x) = Vo log p(x|0) la (14.26) 
and F is the Fisher information matrix, which is essentially the Hessian: 


F = VV logp(x|6)|, (14.27) 


Note that Ê is a function of all the data, so the similarity of x and x’ is computed in the context 
of all the data as well. Also, note that we only have to fit one model. 

The intuition behind the Fisher kernel is the following: let g(x) be the direction (in parameter 
space) in which x would like the parameters to move (from ô) so as to maximize its own 
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Figure 14.2 (a) xor truth table. (b) Fitting a linear logistic regression classifier using degree 10 polynomial 
expansion. (c) Same model, but using an RBF kernel with centroids specified by the 4 black crosses. Figure 
generated by logregXorDemo. 


likelihood; call this the directional gradient. Then we say that two vectors x and x’ are similar 
if their directional gradients are similar wrt the the geometry encoded by the curvature of the 
likelihood function (see Section 7.5.3). 

Interestingly, it was shown in (Saunders et al. 2003) that the string kernel of Section 14.2.6 
is equivalent to the Fisher kernel derived from an L’th order Markov chain (see Section 17.2). 
Also, it was shown in (Elkan 2005) that a kernel defined by the inner product of TF-IDF vectors 
(Section 14.2.2) is approximately equal to the Fisher kernel for a certain generative model of text 
based on the compound Dirichlet multinomial model (Section 3.5.5). 


Using kernels inside GLMs 


In this section, we discuss one simple way to use kernels for classification and regression. We 
will see other approaches later. 


Kernel machines 


We define a kernel machine to be a GLM where the input feature vector has the form 


P(x) = [K(x, Hi), e.’ K(x, Liz) (14.28) 


where u, € X are a set of K centroids. If k is an RBF kernel, this is called an RBF network. 
We discuss ways to choose the jz, parameters below. We will call Equation 14.28 a kernelised 
feature vector. Note that in this approach, the kernel need not be a Mercer kernel. 

We can use the kernelized feature vector for logistic regression by defining p(y|x,@) = 
Ber(w? @(x)). This provides a simple way to define a non-linear decision boundary. As an 
example, consider the data coming from the exclusive or or xor function. This is a binary- 
valued function of two binary inputs. Its truth table is shown in Figure 14.2(a). In Figure 14.2(b), 
we have show some data labeled by the xor function, but we have jittered the points to make 
the picture clearer.! We see we cannot separate the data even using a degree 10 polynomial. 


1. Jittering is a common visualization trick in statistics, wherein points in a plot/display that would otherwise land on 
top of each other are dispersed with uniform additive noise. 
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Figure 14.3 RBF basis in ld. Left column: fitted function. Middle column: basis functions evaluated on 
a grid. Right column: design matrix. Top to bottom we show different bandwidths: 7 = 0.1, 7 = 0.5, 
7 = 50. Figure generated by linregRbfDemo. 


However, using an RBF kernel and just 4 prototypes easily solves the problem as shown in 
Figure 14.2(c). 

We can also use the kernelized feature vector inside a linear regression model by defining 
p(y|x,@) = N(w? d(x), 07). For example, Figure 14.3 shows a ld data set fit with K = 10 
uniformly spaced RBF prototypes, but with the bandwidth ranging from small to large. Small 
values lead to very wiggly functions, since the predicted function value will only be non-zero for 
points x that are close to one of the prototypes uy. If the bandwidth is very large, the design 
matrix reduces to a constant matrix of 1’s, since each point is equally close to every prototype; 
hence the corresponding function is just a straight line. 


LIVMs, RVMs, and other sparse vector machines 


The main issue with kernel machines is: how do we choose the centroids p}? If the input is 
low-dimensional Euclidean space, we can uniformly tile the space occupied by the data with 
prototypes, as we did in Figure 14.2(c). However, this approach breaks down in higher numbers 
of dimensions because of the curse of dimensionality. If u, € RP, we can try to perform 
numerical optimization of these parameters (see e.g., (Haykin 1998)), or we can use MCMC 
inference, (see e.g., (Andrieu et al. 2001; Kohn et al. 2001), but the resulting objective function 
/ posterior is highly multimodal. Furthermore, these techniques is hard to extend to structured 
input spaces, where kernels are most useful. 

Another approach is to find clusters in the data and then to assign one prototype per cluster 
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center (many clustering algorithms just need a similarity metric as input). However, the regions 
of space that have high density are not necessarily the ones where the prototypes are most 
useful for representing the output, that is, clustering is an unsupervised task that may not yield 
a representation that is useful for prediction. Furthermore, there is the need to pick the number 
of clusters. 

A simpler approach is to make each example x; be a prototype, so we get 


(x) = [k(x, x1), ..-, K(X, Xx )| (14.29) 


Now we see D = N, so we have as many parameters as data points. However, we can use any 
of the sparsity-promoting priors for w discussed in Chapter 13 to efficiently select a subset of 
the training exemplars. We call this a sparse vector machine. 

The most natural choice is to use ¢, regularization (Krishnapuram et al. 2005). (Note that in 
the multi-class case, it is necessary to use group lasso, since each exemplar is associated with C 
weights, one per class.) We call this LIVM, which stands for “44 -regularized vector machine”. By 
analogy, we define the use of an 2 regularizer to be a L2VM or “(3-regularized vector machine’; 
this of course will not be sparse. 

We can get even greater sparsity by using ARD/SBL, resulting in a method called the rele- 
vance vector machine or RVM (Tipping 2001). One can fit this model using generic ARD/SBL 
algorithms, although in practice the most common method is the greedy algorithm in (Tipping 
and Faul 2003) (this is the algorithm implemented in Mike Tipping’s code, which is bundled with 
PMTK). 

Another very popular approach to creating a sparse kernel machine is to use a support 
vector machine or SVM. This will be discussed in detail in Section 14.5. Rather than using a 
sparsity-promoting prior, it essentially modifies the likelihood term, which is rather unnatural 
from a Bayesian point of view. Nevertheless, the effect is similar, as we will see. 

In Figure 14.4, we compare L2VM, LIVM, RVM and an SVM using the same RBF kernel on a 
binary classification problem in 2d. For simplicity, A was chosen by hand for L2VM and LIVM; 
for RVMs, the parameters are estimated using empirical Bayes; and for the SVM, we use CV to 
pick C = 1/,, since SVM performance is very sensitive to this parameter (see Section 14.5.3). 
We see that all the methods give similar performance. However, RVM is the sparsest (and hence 
fastest at test time), then LIVM, and then SVM. RVM is also the fastest to train, since CV for an 
SVM is slow. (This is despite the fact that the RVM code is in Matlab and the SVM code is in 
C.) This result is fairly typical. 

In Figure 14.5, we compare L2VM, LIVM, RVM and an SVM using an RBF kernel on a ld 
regression problem. Again, we see that predictions are quite similar, but RVM is the sparsest, 
then L2VM, then SVM. This is further illustrated in Figure 14.6. 


The kernel trick 


Rather than defining our feature vector in terms of kernels, (x) = [K(x,x1),...,4(x,Xn)], 
we can instead work with the original feature vectors x, but modify the algorithm so that it 
replaces all inner products of the form (x, x’) with a call to the kernel function, «(x,x’). This 
is called the kernel trick. It turns out that many algorithms can be kernelized in this way. We 
give some examples below. Note that we require that the kernel be a Mercer kernel for this trick 
to work. 
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Figure 14.4 Example of non-linear binary classification using an RBF kernel with bandwidth o = 0.3. (a) 
L2VM with A = 5. (b) LIVM with A = 1. (c) RVM. (d) SVM with C = 1/A chosen by cross validation. 
Black circles denote the support vectors. Figure generated by kernelBinaryClassifDemo. 


Kernelized nearest neighbor classification 


Recall that in a INN classifier (Section 1.4.2), we just need to compute the Euclidean distance of 
a test vector to all the training points, find the closest one, and look up its label. This can be 
kernelized by observing that 


Ilx: — xy ||3 = (Xi, Xi) + (Xr, XH) — 20K, Xi) (14.30) 


This allows us to apply the nearest neighbor classifier to structured data objects. 


Kernelized K-medoids clustering 


K-means clustering (Section 11.4.2.5) uses Euclidean distance to measure dissimilarity, which is 
not always appropriate for structured objects. We now describe how to develop a kernelized 
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Figure 14.5 Example of kernel based regression on the noisy sinc function using an RBF kernel with 
bandwidth o = 0.3. (a) L2VM with A = 0.5. (b) LIVM with A = 0.5. (c) RVM. (d) SVM regression with 
C = 1/2 chosen by cross validation, and € = 0.1 (the default for SVMlight). Red circles denote the 
retained training exemplars. Figure generated by kernelRegrDemo. 


version of the algorithm. 

The first step is to replace the K-means algorithm with the K-medoids algorothm. This is 
similar to K-means, but instead of representing each cluster’s centroid by the mean of all data 
vectors assigned to this cluster, we make each centroid be one of the data vectors themselves. 
Thus we always deal with integer indexes, rather than data objects. We assign objects to their 
closest centroids as before. When we update the centroids, we look at each object that belongs 
to the cluster, and measure the sum of its distances to all the others in the same cluster; we 
then pick the one which has the smallest such sum: 


my =argmin X` d(i,i’) (14.31) 
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Figure 14.6 Coefficient vectors of length N = 100 for the models in Figure 14.6. Figure generated by 


kernelRegrDemo. 


where 


d(i, i’) > |x; = xv |l 


(14.32) 


This takes O(n?) work per cluster, whereas K-means takes O(n; D) to update each cluster. The 
pseudo-code is given in Algorithm 5. This method can be modified to derive a classifier, by 
computing the nearest medoid for each class. This is known as nearest medoid classification 


(Hastie et al. 2009, p671). 


This algorithm can be kernelized by using Equation 14.30 to replace the distance computation, 


d(i, i’). 
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Algorithm 14.1: K-medoids algorithm 


1 initialize mı; as a random subset of size K from {1,..., N}; 
2 repeat 
3 zi = argmin, d(i, mp) fori = 1 : N; 


4 Mp + argmin;: s =k 2y: =p A U) for k=1: K; 
5 until converged; 


Kernelized ridge regression 


Applying the kernel trick to distance-based methods was straightforward. It is not so obvious 
how to apply it to parametric models such as ridge regression. However, it can be done, as we 
now explain. This will serve as a good “warm up” for studying SVMs. 


The primal problem 


Let x € RP be some feature vector, and X be the corresponding N x D design matrix. We 
want to minimize 


J(w) = (y — Xw)” (y — Xw) +Alļlw]]? (14.33) 
The optimal solution is given by 


w = (XTX + AIp) 1 XTy = (` xix] + Alp) *XTy (14.34) 


The dual problem 


Equation 14.34 is not yet in the form of inner products. However, using the matrix inversion 
lemma (Equation 4.107) we rewrite the ridge estimate as follows 


w = X7(XX? + Mwy) ly (14.35) 


which takes O(N? + N?D) time to compute. This can be advantageous if D is large. Further- 
more, we see that we can partially kernelize this, by replacing XXT with the Gram matrix K. 
But what about the leading XT term? 

Let us define the following dual variables: 


a ê (K+AIy)"ly (14.36) 


Then we can rewrite the primal variables as follows 
N 
w = Xte=) ax (14.37) 
i=l 


This tells us that the solution vector is just a linear sum of the N training vectors. When we 
plug this in at test time to compute the predictive mean, we get 


N N 
f(x) = wTx = 5 aix? x = 5 aik(X, Xi) (14.38) 
i=1 i=1 
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Figure 14.7 Visualization of the first 8 kernel principal component basis functions derived from some 2d 
data. We use an RBF kernel with ø? = 0.1. Figure generated by kpcaScholkopf, written by Bernhard 
Scholkopf. 


So we have succesfully kernelized ridge regression by changing from primal to dual variables. 
This technique can be applied to many other linear models, such as logistic regression. 


Computational cost 


The cost of computing the dual variables œ is O( V3), whereas the cost of computing the primal 
variables w is O(D*). Hence the kernel method can be useful in high dimensional settings, 
even if we only use a linear kernel (c.f., the SVD trick in Equation 7.44). However, prediction 
using the dual variables takes O(N D) time, while prediction using the primal variables only 
takes O(D) time. We can speedup prediction by making œ sparse, as we discuss in Section 14.5. 


Kernel PCA 


In Section 12.2, we saw how we could compute a low-dimensional linear embedding of some 
data using PCA. This required finding the eigenvectors of the sample covariance matrix S = 
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oa Ss x;x? = (1/N)XTX. However, we can also compute PCA by finding the eigenvectors 
of the inner product matrix XXT, as we show below. This will allow us to produce a nonlinear 
embedding, using the kernel trick, a method known as kernel PCA (Schoelkopf et al. 1998). 
First, let U be an orthogonal matrix containing the eigenvectors of XXT with corresponding 
eigenvalues in A. By definition we have (XX7)U = UA. Pre-multiplying by XT gives 


(X7X)(X7U) = (XTU)A (14.39) 


from which we see that the eigenvectors of XTX (and hence of S) are V = XTU, with eigen- 


values given by A as before. However, these eigenvectors are not normalized, since ||v;||? = 
1 
ur XX? u; = Ajj uj = Àj. So the normalized eigenvectors are given by Vpea = XUA ?. 


This is a useful trick for regular PCA if D > N, since XTX has size D x D, whereas XXT 
has size N x N. It will also allow us to use the kernel trick, as we now show. 

Now let K = XXT be the Gram matrix. Recall from Mercer’s theorem that the use of a kernel 
implies some underlying feature space, so we are implicitly replacing x; with @(x;) = ¢;. Let 
® be the corresponding (notional) design matrix, and Sy = +: ys pip? be the corresponding 
(notional) covariance matrix in feature space. The eigenvectors are given by V kpca = TUA ?, 
where U and A contain the eigenvectors and eigenvalues of K. Of course, we can't actually 
compute Vipca, since Q; is potentially infinite dimensional. However, we can compute the 
projection of a test vector x, onto the feature space as follows: 


6! Vinca = PT BUA? =k UA? (14.40) 


where k, = [K(x.,X1),---,K(Xx,XN)]. 

There is one final detail to worry about. So far, we have assumed the projected data has 
zero mean, which is not the case in general. We cannot simply subtract off the mean in 
feature space. However, there is a trick we can use. Define the centered feature vector as 
$; = p(x) - #F Dae (x,;). The Gram matrix of the centered feature vectors is given by 


m ais 


Kij = $9; (14.41) 
ia ee INM 
= “at T T T 
= 684-1 yd at EE oh man 
k=1 k=1 k=1 l=1 
1A 1A IAM 
= K(xX;,x;) — W 5 K(Xi, Xk) — W 5 kK(Xj, Xk) + N? 5 K(Xk, X1) (14.43) 
k=1 k=1 k=1 l=1 
This can be expressed in matrix notation as follows: 
K = HKH (14.44) 


where H £ I — ql nil. is the centering matrix. We can convert all this algebra into the 
pseudocode shown in Algorithm 9. 

Whereas linear PCA is limited to using L < D components, in kPCA, we can use up to N 
components, since the rank of ® is N x D*, where D* is the (potentially infinite) dimensionality 
of embedded feature vectors. Figure 14.7 gives an example of the method applied to some 
D = 2 dimensional data using an RBF kernel. We project points in the unit grid onto the first 
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Algorithm 14.2: Kernel PCA 

1 Input: K of size N x N, K, of size N, x N, num. latent dimensions L; 
2 O=1Ẹp1§/N ; 

3 K=K-OK-KO+O0OKO; 

4 [U, A] = eig(K) ; 

5 fori =1: N do 

e | vi = u/i 

7 O, =1n,15/N ; 

8 K, = K, — O,K, — K,O, + O,K,O, ; 

9 Z=K,V(:,1:L) 


pca kpca 
0.6 5x r r r 0.6 r r r r 
Pe x 
0.4} % Ke x 0.4 
xO R a xy or Xg Se, 
x 
x xxx 
0.2 % M x a *, wt 
x 
x xax 4% “r x” 
0 xt Fy 4 o %e x 
x x 
x * š 
0.2 0.2 
E k x | 
0.4 * x % 0.4 i 
XN ypy Xx 
ae å ox ž 
0.6 x -0.64 i 
0.8 - - 0.8 - - 
06 0.4 0.2 0 02 0.4 06 0.8 08 0.6 0.4 0.2 0 02 04 06 0.8 
(a) (b) 


Figure 14.8 2d visualization of some 2d data. (a) PCA projection. (b) Kernel PCA projection. Figure 
generated by kpcaDemo2, based on code by LJ.P. van der Maaten. 


8 components and visualize the corresponding surfaces using a contour plot. We see that the 
first two component separate the three clusters, and following components split the clusters. 

Although the features learned by kPCA can be useful for classification (Schoelkopf et al. 1998), 
they are not necessarily so useful for data visualization. For example, Figure 14.8 shows the 
projection of the data from Figure 14.7 onto the first 2 principal bases computed using PCA and 
kPCA. Obviously PCA perfectly represents the data. kPCA represents each cluster by a different 
line. 

Of course, there is no need to project 2d data back into 2d. So let us consider a different 
data set. We will use a 12 dimensional data set representing the three known phases of flow 
in an oil pipeline. (This data, which is widely used to compare data visualization methods, is 
synthetic, and comes from (Bishop and James 1993).) We project this into 2d using PCA and 
kPCA (with an RBF kernel). The results are shown in Figure 14.9. If we perform nearest neighbor 
classification in the low-dimensional space, kPCA makes 13 errors and PCA makes 20 (Lawrence 
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Figure 14.9 2d representation of 12 dimensional oil flow data. The different colors/symbols represent the 
3 phases of oil flow. (a) PCA. (b) Kernel PCA with Gaussian kernel. Compare to Figure 15.10(b). From Figure 
1 of (Lawrence 2005). Used with kind permission of Neil Lawrence. 


2005). Nevertheless, the kPCA projection is rather unnatural. In Section 15.5, we will discuss 
how to make kernelized versions of probabilistic PCA. 

Note that there is a close connection between kernel PCA and a technique known as mul- 
tidimensional scaling or MDS. This methods finds a low-dimensional embedding such that 
Euclidean distance in the embedding space approximates the original dissimilarity matrix. See 
e.g., (Williams 2002) for details. 


Support vector machines (SVMs) 


In Section 14.3.2, we saw one way to derive a sparse kernel machine, namely by using a GLM 
with kernel basis functions, plus a sparsity-promoting prior such as ¢; or ARD. An alternative 
approach is to change the objective function from negative log likelihood to some other loss 
function, as we discussed in Section 6.5.5. In particular, consider the £> regularized empirical 
risk function 


N 
J(w, A) = X L(y, ĝi) + All| |? (14.45) 


i=l 


where ĝi = w! x; + wo. (So far this is in the original feature space; we introduce kernels in a 
moment.) If L is quadratic loss, this is equivalent to ridge regression, and if L is the log-loss 
defined in Equation 6.73, this is equivalent to logistic regression. 

In the ridge regression case, we know that the solution to this has the form w = (XTX + 
AI)-!X"y, and plug-in predictions take the form wo + W7x. As we saw in Section 14.4.3, 
we can rewrite these equations in a way that only involves inner products of the form x7 x’, 
which we can replace by calls to a kernel function, «(x,x’). This is kernelized, but not sparse. 
However, if we replace the quadratic/ log-loss with some other loss function, to be explained 
below, we can ensure that the solution is sparse, so that predictions only depend on a subset 
of the training data, known as support vectors. This combination of the kernel trick plus a 
modified loss function is known as a support vector machine or SVM. This technique was 
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Figure 14.10 (a) Illustration of 2, Huber and €-insensitive loss functions, where € = 1.5. Figure generated 
by huberLossDemo. (b) Illustration of the e-tube used in SVM regression. Points above the tube have 
& > 0 and €; = 0. Points below the tube have €; = 0 and €; > 0. Points inside the tube have 
& = €; = 0. Based on Figure 7.7 of (Bishop 2006a). 


originally designed for binary classification, but can be extended to regression and multi-class 
classification as we explain below. 

Note that SVMs are very unnatural from a probabilistic point of view. First, they encode 
sparsity in the loss function rather than the prior. Second, they encode kernels by using an 
algorithmic trick, rather than being an explicit part of the model. Finally, SVMs do not result in 
probabilistic outputs, which causes various difficulties, especially in the multi-class classification 
setting (see Section 14.5.2.4 for details). 

It is possible to obtain sparse, probabilistic, multi-class kernel-based classifiers, which work as 
well or better than SVMs, using techniques such as the LIVM or RVM, discussed in Section 14.3.2. 
However, we include a discussion of SVMs, despite their non-probabilistic nature, for two main 
reasons. First, they are very popular and widely used, so all students of machine learning should 
know about them. Second, they have some computational advantages over probabilistic methods 
in the structured output case; see Section 19.7. 


SVMs for regression 


The problem with kernelized ridge regression is that the solution vector w depends on all the 
training inputs. We now seek a method to produce a sparse estimate. 

Vapnik (Vapnik et al. 1997) proposed a variant of the Huber loss function (Section 7.4) called 
the epsilon insensitive loss function, defined by 


Lely, 9) = a — ĝ|— e otherwise (14.46) 


This means that any point lying inside an ¢-tube around the prediction is not penalized, as in 
Figure 14.10. 
The corresponding objective function is usually written in the following form 


N 
ae | 
J=0Ņ Leyi ĝi) + 5llwll? (14.47) 


i=l 
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where 9; = f(x;) = w’x; + wo and C = 1/) is a regularization constant. This objective is 
convex and unconstrained, but not differentiable, because of the absolute value function in the 
loss term. As in Section 13.4, where we discussed the lasso problem, there are several possible 
algorithms we could use. One popular approach is to formulate the problem as a constrained 
optimization problem. In particular, we introduce slack variables to represent the degree to 
which each point lies outside the tube: 

Yi < f)tet+ GF (14.48) 

g > fe -e=e (14.49) 


Given this, we can rewrite the objective as follows: 


N 
1 
J= CDG +8) + sllwll? (14.50) 


i=l 


This is a quadratic function of w, and must be minimized subject to the linear constraints 
in Equations 14.48-14.49, as well as the positivity constraints €* > 0 and &7 > 0. This is a 
standard quadratic program in 2N + D + 1 variables. 

One can show (see e.g., (Schoelkopf and Smola 2002)) that the optimal solution has the form 


w= Sax: (14.51) 
a 


where a; > 0. Furthermore, it turns out that the œ vector is sparse, because we don't care 
about errors which are smaller than e. The x; for which a; > 0 are called the support vectors; 
thse are points for which the errors lie on or outside the € tube. 

Once the model is trained, we can then make predictions using 


G(x) = tio + Wx (14.52) 
Plugging in the definition of w we get 


O(x) = to + > axx (14.53) 


T. 


Finally, we can replace x; x with «(x;,x) to get a kernelized solution: 


G(x) = wo + 5 aik(Xi, X) (14.54) 


SVMs for classification 

We now discuss how to apply SVMs to classification. We first focus on the binary case, and 
then discuss the multi-class case in Section 14.5.2.4. 

Hinge loss 

In Section 6.5.5, we showed that the negative log likelihood of a logistic regression model, 


Lyy(y. 7) = — log p(y|x, w) = log(1 + e7”) (14.55) 
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was a convex upper bound on the 0-1 risk of a binary classifier, where n = f(x) = w?x + wo 
is the log odds ratio, and we have assumed the labels are y € {1,—1} rather than {0,1}. In 
this section, we replace the NLL loss with the hinge loss, defined as 


Lhinge (%7) = max(0, 1 — yn) = (1 — yn)+ (14.56) 


Here 7 = f(x) is our “confidence” in choosing label y = 1; however, it need not have any 
probabilistic semantics. See Figure 6.7 for a plot. We see that the function looks like a door 
hinge, hence its name. The overall objective has the form 


N 
1 
min 5|hwl|? + C20 = wife) 04.57) 
w0 i=l 


Once again, this is non-differentiable, because of the max term. However, by introducing slack 
variables €;, one can show that this is equivalent to solving 


N 
1 
min, sllwi? +0) & st. €& >0, y(xPwtwo)>1-&,i=1:N (14.58) 
i=1 


w,wo; 


This is a quadratic program in N + D + 1 variables, subjet to O(N) constraints. We 
can eliminate the primal variables w, wo and €;, and just solve the N dual variables, which 
correspond to the Lagrange multipliers for the constraints. Standard solvers take O(N®) time. 
However, specialized algorithms, which avoid the use of generic QP solvers, have been developed 
for this problem, such as the sequential minimal optimization or SMO algorithm (Platt 1998). 
In practice this can take O(N7). However, even this can be too slow if N is large. In such 
settings, it is common to use linear SVMs, which take O(N) time to train Joachims 2006; Bottou 
et al. 2007). 

One can show that the solution has the form 


w= Y aixi (14.59) 
i 


where a; = A;yi and where aœ is sparse (because of the hinge loss). The x; for which a; > 0 are 
called support vectors; these are points which are either incorrectly classified, or are classified 
correctly but are on or inside the margin (we disuss margins below). See Figure 14.12(b) for an 
illustration. 

At test time, prediction is done using 


g(x) = sgn(f(x)) = sgn (tio + W"x) (14.60) 
Using Equation 14.59 and the kernel trick we have 
N 
g(x) = sen (a +) arkai, 3 (14.61) 
i=1 


This takes O(sD) time to compute, where s < N is the number of support vectors. This 
depends on the sparsity level, and hence on the regularizer C. 
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Figure 14.11 Illustration of the large margin principle. Left: a separating hyper-plane with large margin. 
Right: a separating hyper-plane with small margin. 


y>0 


Figure 14.12 (a) Illustration of the geometry of a linear decision boundary in 2d. A point x is classified 
as belonging in decision region R1 if f(x) > 0, otherwise it belongs in decision region R2; here f(x) 
is known as a discriminant function. The decision boundary is the set of points such that f(x) = 0. 
w is a vector which is perpendicular to the decision boundary. The term wo controls the distance of 
the decision boundary from the origin. The signed distance of x from its orthogonal projection onto the 
decision boundary, x1, is given by f(x)/||w||. Based on Figure 4.1 of (Bishop 2006a). (b) Illustration of 
the soft margin principle. Points with circles around them are support vectors. We also indicate the value 
of the corresponding slack variables. Based on Figure 7.3 of (Bishop 2006a). 
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The large margin principle 


In this section, we derive Equation 14.58 form a completely different perspective. Recall that our 
goal is to derive a discriminant function f(x) which will be linear in the feature space implied 
by the choice of kernel. Consider a point x in this induced space. Referring to Figure 14.12(a), 
we see that 

x =x, +r (14.62) 

Ilw] 

where r is the distance of x from the decision boundary whose normal vector is w, and x, is 
the orthogonal projection of x onto this boundary. Hence 


w! w 


(14.63) 


f(x) = wi xt uo = (wxi + wo) bre 


T 
Now f(x.) = 0 so 0 = w?x] + wo. Hence f(x) =r a alle 


We would like to make this distance r = f(x)/||w|| as large as possible, for reasons illustrated 
in Figure 14.11. In particular, there might be many lines that perfectly separate the training data 
(especially if we work in a high dimensional feature space), but intuitively, the best one to pick 
is the one that maximizes the margin, i.e., the perpendicular distance to the closest point. In 
addition, we want to ensure each point is on the correct side of the boundary, hence we want 
f(xi)yi > 0. So our objective becomes 


max min (14.64) 


w,wo i=! Iwl] 
Note that by rescaling the parameters using w —> kw and wo — kwo, we do not change the 
distance of any point to the boundary, since the k factor cancels out when we divide by ||w|]|. 
Therefore let us define the scale factor such that y;f; = 1 for the point that is closest to the 
decision boundary. We therefore want to optimize 


1 
min =||w||?_ st. y(w?x;+ wo) >1,i=1:N (14.65) 
w,wo 2 


(The fact of 4 is added for convenience and doesn’t affect the optimal parameters.) The 
constraint says that we want all points to be on the correct side of the decision boundary with 
a margin of at least 1. For this reason, we say that an SVM is an example of a large margin 
classifier. 

If the data is not linearly separable (even after using the kernel trick), there will be no feasible 
solution in which y;f; > 1 for all 7. We therefore introduce slack variables ¿&; > 0 such that 
ĉi = 0 if the point is on or inside the correct margin boundary, and €; = |y; — f;| otherwise. If 
0 < £; < 1 the point lies inside the margin, but on the correct side of the decision boundary. 
If €; > 1, the point lies on the wrong side of the decision boundary. See Figure 14.12(b). 

We replace the hard constraints that y; fi > 0 with the soft margin constraints that y; fi > 
1 — &;. The new objective becomes 


N 
1 
min =||w||? +C ) & st. & 50, yi(x?wtwo) >1-& (14.66) 
w,wo,€ 2 =i 
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Figure 14.13 Log-odds vs x for 3 different methods. Based on Figure 10 of (Tipping 2001). Used with kind 
permission of Mike Tipping. 


which is the same as Equation 14.58. Since €; > 1 means point 7 is misclassified, we can 
interpret X`; €; as an upper bound on the number of misclassified points. 

The parameter C is a regularization parameter that controls the number of errors we are 
willing to tolerate on the training set. It is common to define this using C = 1/(vN), where 
0 <v < 1 controls the fraction of misclassified points that we allow during the training phase. 
This is called a v-SVM classifier. This is usually set using cross-validation (see Section 14.5.3). 


Probabilistic output 


An SVM classifier produces a hard-labeling, j(x) = sign(f(x)). However, we often want a 
measure of confidence in our prediction. One heuristic approach is to interpret f(x) as the 


log-odds ratio, log s vaca We can then convert the output of an SVM to a probability using 
ply = 1|x, 0) = o(af (x) + b) (14.67) 


where a, b can be estimated by maximum likelihood on a separate validation set. (Using the 
training set to estimate a and b leads to severe overfitting.) This technique was first proposed in 
(Platt 2000). 

However, the resulting probabilities are not particularly well calibrated, since there is nothing 
in the SVM training procedure that justifies interpreting f(x) as a log-odds ratio. To illustrate 
this, consider an example from (Tipping 2001). Suppose we have ld data where p(x|y = 0) = 
Unif (0,1) and p(a|y = 1) = Unif(0.5, 1.5). Since the class-conditional distributions overlap in 
the middle, the log-odds of class 1 over class 0 should be zero in [0.5, 1.0], and infinite outside 
this region. We sampled 1000 points from the model, and then fit an RVM and an SVM with 
a Gaussian kenel of width 0.1. Both models can perfectly capture the decision boundary, and 
achieve a generalizaton error of 25%, which is Bayes optimal in this problem. The probabilistic 
output from the RVM is a good approximation to the true log-odds, but this is not the case for 
the SVM, as shown in Figure 14.13. 
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Figure 14.14 (a) The one-versus-rest approach. The green region is predicted to be both class 1 and class 
2. (b) The one-versus-one approach. The label of the green region is ambiguous. Based on Figure 4.2 of 
(Bishop 2006a). 


SVMs for multi-class classification 


In Section 8.3.7, we saw how we could “upgrade” a binary logistic regression model to the multi- 
class case, by replacing the sigmoid function with the softmax, and the Bernoulli distribution 
with the multinomial. Upgrading an SVM to the multi-class case is not so easy, since the outputs 
are not on a calibrated scale and hence are hard to compare to each other. 

The obvious approach is to use a one-versus-the-rest approach (also called one-vs-all), in 
which we train C binary classifiers, f.(x), where the data from class c is treated as positive, 
and the data from all the other classes is treated as negative. However, this can result in regions 
of input space which are ambiguously labeled, as shown in Figure 14.14(a). 

A common alternative is to pick g(x) = arg max, f.(x). However, this technique may 
not work either, since there is no guarantee that the different fe functions have comparable 
magnitudes. In addition, each binary subproblem is likely to suffer from the class imbalance 
problem. To see this, suppose we have 10 equally represented classes. When training fı, we 
will have 10% positive examples and 90% negative examples, which can hurt performance. It is 
possible to devise ways to train all C classifiers simultaneously (Weston and Watkins 1999), but 
the resulting method takes O(C? N?) time, instead of the usual O(C N?) time. 

Another approach is to use the one-versus-one or OVO approach, also called all pairs, in 
which we train C (C —1)/2 classifiers to discriminate all pairs fe e. We then classify a point into 
the class which has the highest number of votes. However, this can also result in ambiguities, 
as shown in Figure 14.14(b). Also, it takes O(C? N?) time to train and O(C?N,,,) to test each 
data point, where Ns, is the number of support vectors.’ See also (Allwein et al. 2000) for an 
approach based on error-correcting output codes. 

It is worth remembering that all of these difficulties, and the plethora of heuristics that have 
been proposed to fix them, fundamentally arise because SVMs do not model uncertainty using 
probabilities, so their output scores are not comparable across classes. 


2. We can reduce the test time by structuring the classes into a DAG (directed acyclic graph), and performing O(C) 
pairwise comparisons (Platt et al. 2000). However, the O(C?) factor in the training time is unavoidable. 
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Figure 14.15 (a) A cross validation estimate of the 0-1 error for an SVM classifier with RBF kernel with 
different precisions y = 1/(20°) and different regularizer A = 1/C, applied to a synthetic data set drawn 
from a mixture of 2 Gaussians. (b) A slice through this surface for y = 5 The red dotted line is the Bayes 
optimal error, computed using Bayes rule applied to the model used to generate the data. Based on Figure 
12.6 of (Hastie et al. 2009). Figure generated by svmCgammaDemo. 


Choosing C 


SVMs for both classification and regression require that you specify the kernel function and the 
parameter C. Typically C is chosen by cross-validation. Note, however, that C interacts quite 
strongly with the kernel parameters. For example, suppose we are using an RBF kernel with 
precision y = z- If y = 5, corresponding to narrow kernels, we need heavy regularization, 
and hence small C (so A = 1/C is big). If y = 1, a larger value of C should be used. So we 
see that y and C are tightly coupled. This is illustrated in Figure 14.15, which shows the CV 
estimate of the 0-1 risk as a function of C and y. 

The authors of libsvm recommend (Hsu et al. 2009) using CV over a 2d grid with values C € 
{275,273,...,215} and y € {2715,2713,...,23}. In addition, it is important to standardize 
the data first, for a spherical Gaussian kernel to make sense. 

To choose C efficiently, one can develop a path following algorithm in the spirit of lars 
(Section 13.3.4). The basic idea is to start with A large, so that the margin 1/||w(A)|| is wide, 
and hence all points are inside of it and have a; = 1. By slowly decreasing A, a small set of 
points will move from inside the margin to outside, and their a; values will change from 1 to 0, 
as they cease to be support vectors. When A is maximal, the function is completely smoothed, 
and no support vectors remain. See (Hastie et al. 2004) for the details. 


Summary of key points 


Summarizing the above discussion, we recognize that SVM classifiers involve three key ingre- 
dients: the kernel trick, sparsity, and the large margin principle. The kernel trick is necessary 
to prevent underfitting, i.e., to ensure that the feature vector is sufficiently rich that a linear 
classifier can separate the data. (Recall from Section 14.2.3 that any Mercer kernel can be viewed 
as implicitly defining a potentially high dimensional feature vector.) If the original features are 
already high dimensional (as in many gene expression and text classification problems), it suf- 
fices to use a linear kernel, «(x, x’) = x’ x’, which is equivalent to working with the original 
features. 
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Method Opt. w Opt. kernel Sparse Prob. Multiclass Non-Mercer Section 
L2VM Convex EB No Yes Yes Yes 14.3.2 
LIVM Convex CV Yes Yes Yes Yes 14.3.2 
RVM Not convex EB Yes Yes Yes Yes 14.3.2 
SVM Convex CV Yes No Indirectly No 14.5 
GP N/A EB No Yes Yes No 15 


Table 14.1 Comparison of various kernel based classifiers. EB = empirical Bayes, CV = cross validation. 
See text for details. 


The sparsity and large margin principles are necessary to prevent overfitting, i.e., to ensure 
that we do not use all the basis functions. These two ideas are closely related to each other, 
and both arise (in this case) from the use of the hinge loss function. However, there are other 
methods of achieving sparsity (such as £1), and also other methods of maximizing the margin 
(such as boosting). A deeper discussion of this point takes us outside of the scope of this book. 
See e.g., (Hastie et al. 2009) for more information. 


A probabilistic interpretation of SVMs 


In Section 14.3, we saw how to use kernels inside GLMs to derive probabilistic classifiers, such as 
the LIVM and RVM. And in Section 15.3, we will discuss Gaussian process classifiers, which also 
use kernels. However, all of these approaches use a logistic or probit likelihood, as opposed to 
the hinge loss used by SVMs. It is natural to wonder if one can interpret the SVM more directly 
as a probabilistic model. To do so, we must interpret Cg(m) as a negative log likelihood, where 
g(m) = (1 — m)4, where m = yf(x) is the margin. Hence p(y = 1|f) = exp(—Cg(f)) 
and p(y = —1|f) = exp(—Cg(—f)). By summing over both values of y, we require that 
exp(—Cg(f)) + exp(—Cg(—f)) be a constant independent of f. But it turns out this is not 
possible for any C > 0 (Sollich 2002). 

However, if we are willing to relax the sum-to-one condition, and work with a pseudo- 
likelihood, we can derive a probabilistic interpretation of the hinge loss (Polson and Scott 2011). 
In particular, one can show that 


pein Pane 
1 (1+; — yix] w) Ja (14.68) 


A ol 
oxp(—2(1 — yıx” = 
T = f° re exp (3 OE 


Thus the exponential of the negative hinge loss can be represented as a Gaussian scale mixture. 
This allows one to fit an SVM using EM or Gibbs sampling, where A; are the latent variables. This 
in turn opens the door to Bayesian methods for setting the hyper-parameters for the prior on 
w. See (Polson and Scott 2011) for details. (See also (Franc et al. 2011) for a different probabilistic 
interpretation of SVMs.) 


Comparison of discriminative kernel methods 


We have mentioned several different methods for classification and regression based on kernels, 
which we summarize in Table 14.1. (GP stands for “Gaussian process”, which we discuss in 
Chapter 15.) The columns have the following meaning: 
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e Optimize w: a key question is whether the objective J(w) = — log p(D|w) — log p(w) 
is convex or not. L2VM, LIVM and SVMs have convex objectives. RVMs do not. GPs are 
Bayesian methods that do not perform parameter estimation. 


e Optimize kernel: all the methods require that one “tune” the kernel parameters, such as the 
bandwidth of the RBF kernel, as well as the level of regularization. For methods based on 
Gaussians, including L2VM, RVMs and GPs, we can use efficient gradient based optimizers to 
maximize the marginal likelihood. For SVMs, and LIVM, we must use cross validation, which 
is slower (see Section 14.5.3). 


e Sparse: LIVM, RVMs and SVMs are sparse kernel methods, in that they only use a subset of 
the training examples. GPs and L2VM are not sparse: they use all the training examples. The 
principle advantage of sparsity is that prediction at test time is usually faster. In addition, 
one can sometimes get improved accuracy. 


e Probabilistic: All the methods except for SVMs produce probabilistic output of the form 
p(y|x). SVMs produce a “confidence” value that can be converted to a probability, but such 
probabilities are usually very poorly calibrated (see Section 14.5.2.3). 


e Multiclass: All the methods except for SVMs naturally work in the multiclass setting, by using 
a multinoulli output instead of Bernoulli. The SVM can be made into a multiclass classifier, 
but there are various difficulties with this approach, as discussed in Section 14.5.2.4. 


e Mercer kernel: SVMs and GPs require that the kernel is positive definite; the other techniques 
do not. 


Apart from these differences, there is the natural question: which method works best? In 
a small experiment*, we found that all of these methods had similar accuracy when averaged 
over a range of problems, provided they have the same kernel, and provided the regularization 
constants are chosen appropriately. 

Given that the statistical performance is roughly the same, what about the computational 
performance? GPs and L2VM are generally the slowest, taking O(N°) time, since they don’t 
exploit sparsity (although various speedups are possible, see Section 15.6). SVMs also take 
O(N?) time to train (unless we use a linear kernel, in which case we only need O(N) time 
Joachims 2006)). However, the need to use cross validation can make SVMs slower than RVMs. 
LIVM should be faster than an RVM, since an RVM requires multiple rounds of ¢; minimization 
(see Section 13.7.4.3). However, in practice it is common to use a greedy method to train RVMs, 
which is faster than Z4 minimization. This is reflected in our empirical results. 

The conclusion of all this is as follows: if speed matters, use an RVM, but if well-calibrated 
probabilistic output matters (e.g., for active learning or control problems), use a GP. The only 
circumstances under which using an SVM seems sensible is the structured output case, where 
likelihood-based methods can be slow. (We attribute the enormous popularity of SVMs not 
to their superiority, but to ignorance of the alternatives, and also to the lack of high quality 
software implementing the alternatives.) 

Section 16.7.1 gives a more extensive experimental comparison of supervised learning methods, 
including SVMs and various non kernel methods. 


3. See http: //pmtk3.googlecode.com/svn/trunk/docs/tutorial/html/tutKernelClassif.html. 
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Figure 14.16 A comparison of some popular smoothing kernels. The boxcar kernel has compact support 
but is not smooth. The Epanechnikov kernel has compact support but is not differentiable at its boundary. 
The tri-cube has compact support and two continuous derivatives at the boundary of its support. The 
Gaussian is differentiable, but does not have compact support. Based on Figure 6.2 of (Hastie et al. 2009). 
Figure generated by smoothingKernelPlot. 


Kernels for building generative models 


There is a different kind of kernel known as a smoothing kernel which can be used to create 
non-parametric density estimates. This can be used for unsupervised density estimation, p(x), 
as well as for creating generative models for classification and regression by making models of 
the form p(y, x). 


Smoothing kernels 


A smoothing kernel is a function of one argument which satisfies the following properties: 


[sae =l, foa = 0, feroa >0 (14.69) 


A simple example is the Gaussian kernel, 


A 1 —ax? /2 
K(x) = e (14.70) 
(e) (27) 
We can control the width of the kernel by introducing a bandwidth parameter h: 
1 
kalz) ê zs(Z) (4.70 


We can generalize to vector valued inputs by defining an RBF kernel: 
a(x) = a (lix) (14.72) 


In the case of the Gaussian kernel, this becomes 


rn) = DA Too -3 x?) (14.73) 
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Although Gaussian kernels are popular, they have unbounded support. An alternative kernel, 

with compact support, is the Epanechnikov kernel, defined by 
3 

K(x) £ I4 — x*)I(\x| < 1) (14.74) 
This is plotted in Figure 14.16. Compact support can be useful for efficiency reasons, since one 
can use fast nearest neighbor methods to evaluate the density. 

Unfortunately, the Epanechnikov kernel is not differentiable at the boundary of its support. 
An alterative is the tri-cube kernel, defined as follows: 


70 
K(x) £ a — |æ’ )’I(]z| < 1) (14.75) 
This has compact support and has two continuous derivatives at the boundary of its support. 
See Figure 14.16. 
The boxcar kernel is simply the uniform distribution: 


K(z) & I(\2| <1) (14.76) 


We will use this kernel below. 


Kernel density estimation (KDE) 


Recall the Gaussian mixture model from Section 11.2.1. This is a parametric density estimator for 
data in RP. However, it requires specifying the number K and locations p, of the clusters. An 
alternative to estimating the jz, is to allocate one cluster center per data point, so yz; = x;. In 
this case, the model becomes 


N 
1 
p(xlD) = 5 DN (ala o°I) (14.77) 
We can generalize the approach by writing 


N 
1 
p(x) = W X Kh (X — Xi) (14.78) 
i=l 


This is called a Parzen window density estimator, or kernel density estimator (KDE), and 
is a simple non-parametric density model. The advantage over a parametric model is that no 
model fitting is required (except for tuning the bandwidth, usually done by cross-validation). and 
there is no need to pick K. The disadvantage is that the model takes a lot of memory to store, 
and a lot of time to evaluate. It is also of no use for clustering tasks. 

Figure 14.17 illustrates KDE in ld for two kinds of kernel. On the top, we use a boxcar kernel, 
k(x) = I(—1 < z < 1). The result is equivalent to a histogram estimate of the density, since 
we just count how many data points land within an interval of size h around x;. On the bottom, 
we use a Gaussian kernel, which results in a smoother fit. 

The usual way to pick h is to minimize an estimate (such as cross validation) of the frequentist 
risk (see e.g., (Bowman and Azzalini 1997)). In Section 25.2, we discuss a Bayesian approach to 
non-parametric density estimation, based on Dirichlet process mixture models, which allows us 
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unif, h=1.000 unif, h=2.000 


gauss, h=1.000 gauss, h=2.000 


Figure 14.17 A nonparametric (Parzen) density estimator in 1D estimated from 6 data points, denoted 
by x. Top row: uniform kernel. Bottom row: Gaussian kernel. Rows represent increasingly large band- 
width parameters. Based on http: //en.wikipedia.org/wiki/Kernel_density_estimation. Figure 
generated by parzenWindowDemo2. 


to infer h. DP mixtures can also be more efficient than KDE, since they do not need to store 
all the data. See also Section 15.2.4 where we discuss an empirical Bayes approach to estimating 
kernel parameters in a Gaussian process model for classification/ regression. 


From KDE to KNN 


We can use KDE to define the class conditional densities in a generative classifier. This turns 
out to provide an alternative derivation of the nearest neighbors classifier, which we introduced 
in Section 1.4.2. To show this, we follow the presentation of (Bishop 2006a, p125). In kde 
with a boxcar kernel, we fixed the bandwidth and count how many data points fall within the 
hyper-cube centered on a datapoint. Suppose that, instead of fixing the bandwidth h, we instead 
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Gaussian kernel regression 
r T r 


r 
true 
* data 
|| — — estimate 


Figure 14.18 An example of kernel regression in ld using a Gaussian kernel. Figure generated by 
kernelRegressionDemo, based on code by Yi Cao. 


allow the bandwidth or volume to be different for each data point. Specifically, we will “grow” 
a volume around x until we encounter K data points, regardless of their class label. Let the 
resulting volume have size V(x) (this was previously A?), and let there be N.(x) examples 
from class c in this volume. Then we can estimate the class conditional density as follows: 
N.(x) 
= c, D) = —— 14.79 
pixy =oD) = 5 or 0479) 
where Ne is the total number of examples in class c in the whole data set. The class prior can 
be estimated by 


u= (14.80) 
py=e = WwW i 
Hence the class posterior is given by 
Ne(x) Ne 
‘=i = Oe (14.81) 
ooy Le Nee) K 


where we used the fact that X`, N.(x) = K, since we choose a total of K points (regardless of 
class) around every point. This is equivalent to Equation 1.2, since N.(x) = Vien, (x,D) I(y; = 


c). 


Kernel regression 


In Section 14.7.2, we discussed the use of kernel density estimation or KDE for unsupervised 
learning. We can also use KDE for regression. The goal is to compute the conditional expectation 


f(x) = E [y|x] = TEOR = ee (14.82) 
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We can use KDE to approximate the joint density p(x, y) as follows: 


N 


1 
p(x y) = 55D Kn — x5) ny — yi) (14.83) 
i=] 


Hence 
4 D Kh(X — Xi) J yka (y — yi)dy 
f(x) = A=) J (14.84) 
N viel Kn(X — Xi) J Kaly — yi)dy 


D oi ahs 
L Lier a(x = Xi) yi (14.85) 


Diax) 
To derive this result, we used two properties of smoothing kernels. First, that they integrate to 
one, ie. f &,(y — yi)dy = 1. And second, the fact that f ysn(y — yi)dy = yi. This follows by 
defining x = y — y; and using the zero mean property of smoothing kernels: 


fe + yi)ka(x)dx = J ernle)dz +yi J Kn(a)dx = 0 + yi = yi (14.86) 
We can rewrite the above result as follows: 
N 
f(x) = X wl) (14.87) 
i=1 


Kn (X — Xi) 
N 
Dva Ka(X — Xi) 


We see that the prediction is just a weighted sum of the outputs at the training points, where 
the weights depend on how similar x is to the stored training points. This method is called 
kernel regression, kernel smoothing, or the Nadaraya-Watson model. See Figure 14.18 for an 
example, where we use a Gaussian kernel. 

Note that this method only has one free parameter, namely h. One can show (Bowman and 
Azzalini 1997) that for 1d data, if the true density is Gaussian and we are using Gaussian kernels, 
the optimal bandwidth h is given by 


4 \ 1⁄5 
h= | = 5 14. 
(sx) Go (14.89) 


I> 


w(x) (14.88) 


We can compute a robust approximation to the standard deviation by first computing the mean 
absolute deviation 


MAD = median(|x — median(x)|) (14.90) 


and then using 


1 
0.6745 


The code used to produce Figure 14.18 estimated h, and hy separately, and then set h = \/hzhy. 


õ = 1.4826 MAD = (14.91) 
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Although these heuristics seem to work well, their derivation rests on some rather dubious 
assumptions (such as Gaussianity of the true density). Furthermore, these heuristics are limited 
to tuning just a single parameter. In Section 15.2.4 we discuss an empirical Bayes approach to 
estimating multiple kernel parameters in a Gaussian process model for classification/ regression, 
which can handle many tuning parameters, and which is based on much more transparent 
principles (maximizing the marginal likelihood). 


Locally weighted regression 


If we define K} (xX — x;) = K(x, Xi), we can rewrite the prediction made by kernel regression as 
follows 


K(Xx, X;) 
> nS oe (14.92) 


Era no xi) 


Note that ae need not be a smoothing kernel. If it is not, we no longer need the 
normalization term, so we can just write 


N 
F) = gee) (14.93) 


This model is essentially fitting a constant function locally. We can improve on this by fitting a 
linear regression model for each point x, by solving 
N 
. T 2 

min K(X., Xi) [yj — B(x) P(X; (14.94) 

Heap ( J (x) O(xi)] 
where (x) = [1, x]. This is called locally weighted regression. An example of such a method 
is LOESS, aka LOWESS, which stands for “locally-weighted scatterplot smoothing” (Cleveland 
and Devlin 1988). See also (Edakunni et al. 2010) for a Bayesian version of this model. 

We can compute the paramters (3(x.) for each test case by solving the following weighted 
least squares problem: 


B(x.) = (®7 D(x,)&)-'&7 D(x, )y (14.95) 


where ® is an N x (D + 1) design matrix and D = diag(K(x,,x;)). The corresponding 
prediction has the form 


f(x») = plr) T B(x) = (®7D(x,)&)-'S" D(x, )y = Som (xx)y (14.96) 


The term w;(x,), which combines the local smoothing kernel a the effect of linear regression, 
is called the equivalent kernel. See also Section 15.4.2. 


Exercises 


Exercise 14.1 Fitting an SVM classifier by hand 
(Source: Jaakkola.) Consider a dataset with 2 points in ld: (xı = 0, yı = w 
Consider mapping each point to 3d using the feature vector (x) = [1, +e 


and (#2 = J2,y2 = 1). 
ali 


. (This is equivalent to 
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using a second order polynomial kernel.) The max margin classifier has the form 


min ||w]|? s-t. (14.97) 
yi(w" (x1) + wo) > 1 (14.98) 
y2(w’ ¢(x2) + wo) > 1 (14.99) 


a. Write down a vector that is parallel to the optimal vector w. Hint: recall from Figure 7.8 (12Aprl0 
version) that w is perpendicular to the decision boundary between the two points in the 3d feature 
space. 


b. What is the value of the margin that is achieved by this w? Hint: recall that the margin is the distance 
from each support vector to the decision boundary. Hint 2: think about the geometry of 2 points in 
space, with a line separating one from the other. 


c. Solve for w, using the fact the margin is equal to 1/||w1|. 


d. Solve for wo using your value for w and Equations 14.97 to 14.99. Hint: the points will be on the 
decision boundary, so the inequalities will be tight. 


e. Write down the form of the discriminant function f(x) = wo + wT @(z) as an explicit function of z. 


Exercise 14.2 Linear separability 


(Source: Koller..) Consider fitting an SVM with C > 0 to a dataset that is linearly separable. Is the resulting 
decision boundary guaranteed to separate the classes? 


15.1 


Gaussian processes 


Introduction 


In supervised learning, we observe some inputs x; and some outputs y;. We assume that 
Yi = f(x), for some unknown function f, possibly corrupted by noise. The optimal approach 
is to infer a distribution over functions given the data, p(f|X,y), and then to use this to make 
predictions given new inputs, i.e., to compute 


p(velxe, X, y) = J pulia mUd a5.) 


Up until now, we have focussed on parametric representations for the function f, so that 
instead of inferring p(f|D), we infer p(@|D). In this chapter, we discuss a way to perform 
Bayesian inference over functions themselves. 

Our approach will be based on Gaussian processes or GPs. A GP defines a prior over 
functions, which can be converted into a posterior over functions once we have seen some data. 
Although it might seem difficult to represent a distribution over a function, it turns out that we 
only need to be able to define a distribution over the function’s values at a finite, but arbitrary, 
set of points, say X1, ...,Xy. A GP assumes that p(f(xi),..., f(x)) is jointly Gaussian, with 
some mean u(x) and covariance X(x) given by 4;; = «(x;,x,), where « is a positive definite 
kernel function (see Section 14.2 information on kernels). The key idea is that if x; and x; are 
deemed by the kernel to be similar, then we expect the output of the function at those points 
to be similar, too. See Figure 15.1 for an illustration. 

It turns out that, in the regression setting, all these computations can be done in closed form, 
in O(N 3) time. (We discuss faster approximations in Section 15.6.) In the classification setting, 
we must use approximations, such as the Gaussian approximation, since the posterior is no 
longer exactly Gaussian. 

GPs can be thought of as a Bayesian alternative to the kernel methods we discussed in Chap- 
ter 14, including LIVM, RVM and SVM. Although those methods are sparser and therefore faster, 
they do not give well-calibrated probabilistic outputs (see Section 15.4.4 for further discussion). 
Having properly tuned probabilistic output is important in certain applications, such as online 
tracking for vision and robotics (Ko and Fox 2009), reinforcement learning and optimal control 
(Engel et al. 2005; Deisenroth et al. 2009), global optimization of non-convex functions (Mockus 
et al. 1996; Lizotte 2008; Brochu et al. 2009), experiment design (Santner et al. 2003), etc. 
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Figure 15.1 A Gaussian process for 2 training points and 1 testing point, represented as a mixed directed 
and undirected graphical model representing p(y, f|x) = N (£|0, K(x)) [], p(y| fi). The hidden nodes 
fi = (xi) represent the value of the function at each of the data points. These hidden nodes are fully 
interconnected by undirected edges, forming a Gaussian graphical model; the edge strengths represent the 
covariance terms X;j = «(x:,x;). If the test point x. is similar to the training points x; and x2, then 
the predicted output y. will be similar to yı and y2. 


Our presentation is closely based on (Rasmussen and Williams 2006), which should be con- 
sulted for futher details. See also (Diggle and Ribeiro 2007), which discusses the related approach 
known as kriging, which is widely used in the spatial statistics literature. 


GPs for regression 


In this section, we discuss GPs for regression. Let the prior on the regression function be a GP, 
denoted by 


f(x) ~ GP(m(x), K(x, x’)) (15.2) 
where m(x) is the mean function and «(x, x’) is the kernel or covariance function, i.e., 

m(x) = E[f(x)] (15.3) 

w(x) = ENEE) — m(x))(F6c) — me)" 05.4) 


We obviously require that «() be a positive definite kernel. For any finite set of points, this 
process defines a joint Gaussian: 


p(£|X) = N (flu, K) (15.5) 


where Aj; = &(x;,x,;) and u = (m(x1),...,m(xy)). 

Note that it is common to use a mean function of m(x) = 0, since the GP is flexible enough 
to model the mean arbitrarily well, as we will see below. However, in Section 15.2.6 we will 
consider parametric models for the mean function, so the GP just has to model the residual 
errors. This semi-parametric approach combines the interpretability of parametric models with 
the accuracy of non-parametric models. 
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5 o 5 “5 0 5 


(a) (b) 


Figure 15.2 Left: some functions sampled from a GP prior with SE kernel. Right: some samples from a GP 
posterior, after conditioning on 5 noise-free observations. The shaded area represents E [f (x)|]+2std(f (x). 
Based on Figure 2.2 of (Rasmussen and Williams 2006). Figure generated by gprDemoNoiseFree. 


Predictions using noise-free observations 


Suppose we observe a training set D = {(x;, fi),i = 1 : N}, where fi = f(x;) is the noise-free 
observation of the function evaluated at x;. Given a test set X., of size N, x D, we want to 
predict the function outputs f,. 

If we ask the GP to predict f(x) for a value of x that it has already seen, we want the GP to 
return the answer f(x) with no uncertainty. In other words, it should act as an interpolator 
of the training data. This will only happen if we assume the observations are noiseless. We will 
consider the case of noisy observations below. 

Now we return to the prediction problem. By definition of the GP, the joint distribution has 
the following form 


(r(e Ge) T 


where K = K(X, X) is Nx N, K, = K(X, X.) is N x N,, and K,, = «(X., Xa) is Na X Ny. 
By the standard rules for conditioning Gaussians (Section 4.3), the posterior has the following 
form 


H = p(X.) +KI K(f- u(X)) (15.8) 
=, = K,,—K?K"'k, (15.9) 


This process is illustrated in Figure 15.2. On the left we show sample samples from the prior, 
p(£|X), where we use a squared exponential kernel, aka Gaussian kernel or RBF kernel. In 
ld, this is given by 

1 
207 
Here l controls the horizontal length scale over which the function varies, and o% controls the 
vertical variation. (We discuss how to estimate such kernel parameters below.) On the right we 


k(x, x) = oF exp(—= (x — 2’)?) (15.10) 
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show samples from the posterior, p(f,.|X.,X,f). We see that the model perfectly interpolates 
the training data, and that the predictive uncertainty increases as we move further away from 
the observed data. 

One application of noise-free GP regression is as a computationally cheap proxy for the 
behavior of a complex simulator, such as a weather forecasting program. (If the simulator is 
stochastic, we can define f to be its mean output; note that there is still no observation noise.) 
One can then estimate the effect of changing simulator parameters by examining their effect 
on the GP’s predictions, rather than having to run the simulator many times, which may be 
prohibitively slow. This strategy is known as DACE, which stands for design and analysis of 
computer experiments (Santner et al. 2003). 


Predictions using noisy observations 


Now let us consider the case where what we observe is a noisy version of the underlying 
function, y = f(x) + e, where € ~ N (0, 02). In this case, the model is not required to 
interpolate the data, but it must come “close” to the observed data. The covariance of the 
observed noisy responses is 


cov [yp, Yq] = K (Xp, Xq) + 77 5pq (15.1 
where ôpq = I(p = q). In other words 
cov [y|X] =K+o07In Ê K; (15.12) 


The second matrix is diagonal because we assumed the noise terms were independently added 
to each observation. 

The joint density of the observed data and the latent, noise-free function on the test points 
is given by 


a ~N (0, (it a )) (15.13) 


where we are assuming the mean is zero, for notational simplicity. Hence the posterior predictive 
density is 


PEX, X, y) = N (fly, Ex) (15.14) 
u. = KIK, 'y (15.15) 
D = Ke- KIK, K. (15.16) 

In the case of a single test input, this simplifies as follows 
pfs, X, y) = NGI K7 ty, kss — k? K7 ke) (15.17) 
where k, = [K(xX,,X1),---,4(Xx,Xy)] and k,, = «(x,,x,). Another way to write the 


posterior mean is as follows: 
N 
Je =KLK, y=) ain(xi x.) (15.18) 
i=1 


where @ = K; ty. We will revisit this expression later. 
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Figure 15.3 Some ld GPs with SE kernels but different hyper-parameters fit to 20 noisy observations. The 
kernel has the form in Equation 15.19. The hyper-parameters (4, of, y) are as follows: (a) (1,1,0.1) (b) (0.3, 
0.1.08, 0.00005), (c) (3.0, 1.16, 0.89). Based on Figure 2.5 of (Rasmussen and Williams 2006). Figure generated 
by gprDemoChangeHparams, written by Carl Rasmussen. 


Effect of the kernel parameters 


The predictive performance of GPs depends exclusively on the suitability of the chosen kernel. 
Suppose we choose the following squared-exponential (SE) kernel for the noisy observations 


Ky (2p, Lq) = oF exp( TE (£p — £4)) 4 Gong (15.19) 


Here £ is the horizontal scale over which the function changes, oF controls the vertical scale of 
the function, and o; is the noise variance. Figure 15.3 illustrates the effects of changing these 
parameters. We sampled 20 noisy data points from the SE kernel using (4, of, oy) = (1,1,0.1), 
and then made predictions various parameters, conditional on the data. In Figure 15.3(a), we use 
(L,of, cy) = (1,1,0.1), and the result is a good fit. In Figure 15.3(b), we reduce the length scale 
to £ = 0.3 (the other parameters were optimized by maximum (marginal) likelihood, a technique 
we discuss below); now the function looks more “wiggly”. Also, the uncertainty goes up faster, 
since the effective distance from the training points increases more rapidly. In Figure 15.3(c), we 
increase the length scale to £ = 3; now the function looks smoother. 
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Figure 15.4 Some 2d functions sampled from a GP with an SE kernel but different hyper-parameters. The 
kernel has the form in Equation 15.20 where (a) M = I, (b) M = diag(1, 3) (c) M = (1, -1;-1,1) + 
diag(6, 6)~°. Based on Figure 5.1 of (Rasmussen and Williams 2006). Figure generated by gprDemoArd, 
written by Carl Rasmussen. 


We can extend the SE kernel to multiple dimensions as follows: 


1 
Ky (Xps Xg) = oF exp(—5 (Xp — x4) M(x, — Xq)) + or Opy (15.20) 


We can define the matrix M in several ways. The simplest is to use an isotropic matrix, 
M, = ¢-7I. See Figure 15.4(a) for an example. We can also endow each dimension with its 
own characteristic length scale, Mo = diag(@)~?. If any of these length scales become large, 
the corresponding feature dimension is deemed “irrelevant”, just as in ARD (Section 13.7). In 
Figure 15.4(b), we use M = M, with £ = (1,3), so the function changes faster along the zı 
direction than the x2 direction. 

We can also create a matrix of the form M3 = AA? +diag(£)~?, where A isa Dx K matrix, 
where K < D. (Rasmussen and Williams 2006, p107) calls this the factor analysis distance 
function, by analogy to the fact that factor analysis (Section 12.1) approximates a covariance 
matrix as a low rank matrix plus a diagonal matrix. The columns of A correspond to relevant 
directions in input space. In Figure 15.4(c), we use £ = (6;6) and A = (1; —1), so the function 
changes mostly rapidly in the direction which is perpendicular to (1,1). 
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15.2.4 Estimating the kernel parameters 


To estimate the kernel parameters, we could use exhaustive search over a discrete grid of values, 
with validation loss as an objective, but this can be quite slow. (This is the approach used to 
tune kernels used by SVMs.) Here we consider an empirical Bayes approach, which will allow us 
to use continuous optimization methods, which are much faster. In particular, we will maximize 
the marginal likelihood! 


P(y|X) = I p(y |f, X)p(f|X)df (15.21) 
Since p(f|X) = N (£|0, K), and p(y|f) = J; M(yil fi, 07), the marginal likelihood is given by 


1 1 N 
log p(y|X) = log N (y|0, Ky) = —SyK, ty — z los [Kyl — 5 lo8(27) (15.22) 


The first term is a data fit term, the second term is a model complexity term, and the third term 
is just a constant. To understand the tradeoff between the first two terms, consider a SE kernel 
in 1D, as we vary the length scale ¢ and hold o? fixed. Let J(£) = — log p(y|X, £). For short 
length scales, the fit will be good, so y'K,'y will be small. However, the model complexity 
will be high: K will be almost diagonal (as in Figure 14.3, top right), since most points will not 
be considered “near” any others, so the log |K,,| will be large. For long length scales, the fit will 
be poor but the model complexity will be low: K will be almost all 1’s (as in Figure 14.3, bottom 
right), so log |K,| will be small. 

We now discuss how to maximize the marginal likelhiood. Let the kernel parameters (also 
called hyper-parameters) be denoted by 8. One can show that 


a = tO - 1l 1K; 
= 1 T —1 oK; 
= 5tr ((aa Ky) 7 (15.24) 


where a = K, ty. It takes O(N?) time to compute K} +, and then O(N?) time per hyper- 
parameter to compute the gradient. 

The form of oe depends on the form of the kernel, and which parameter we are taking 
derivatives with respect to. Often we have constraints on the hyper-parameters, such as o? > 0. 
In this case, we can define 06 = log(o7), and then use the chain rule. 

Given an expression for the log marginal likelihood and its derivative, we can estimate the 
kernel parameters using any standard gradient-based optimizer. However, since the objective is 
not convex, local minima can be a problem, as we illustrate below. 


15.2.4.1 Example 


Consider Figure 15.5. We use the SE kernel in Equation 15.19 with oF = 1, and plot log p(y|X, £, 07) 
(where X and y are the 7 data points shown in panels b and c) as we vary / and one The two 


1. The reason it is called the marginal likelihood, rather than just likelihood, is because we have marginalized out the 
latent Gaussian vector f. This moves us up one level of the Bayesian hierarchy, and reduces the chances of overfitting 
(the number of kernel parameters is usually fairly small compared to a standard parametric model). 
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Figure 15.5 Illustration of local minima in the marginal likelihood surface. (a) We plot the log marginal 
likelihood vs oz and Z, for fixed oF = 1, using the 7 data points shown in panels b and c. (b) The function 
corresponding to the lower left local minimum, (¢,02) œ~ (1, 0.2). This is quite “wiggly” and has low 
noise. (c) The function corresponding to the top right local minimum, (£, o2) ~ (10,0.8). This is quite 
smooth and has high noise. The data was generated using (¢,07) = (1,0.1). Source: Figure 5.5 of 
(Rasmussen and Williams 2006). Figure generated by gprDemoMarg1lik, written by Carl Rasmussen. 


local optima are indicated by +. The bottom left optimum corresponds to a low-noise, short- 
length scale solution (shown in panel b). The top right optimum corresponds to a high-noise, 
long-length scale solution (shown in panel c). With only 7 data points, there is not enough 
evidence to confidently decide which is more reasonable, although the more complex model 
(panel b) has a marginal likelihood that is about 60% higher than the simpler model (panel c). 
With more data, the MAP estimate should come to dominate. 
Figure 15.5 illustrates some other interesting (and typical) features. The region where a; 

(top of panel a) corresponds to the case where the noise is very high; in this regime, the marginal 
likelihood is insensitive to the length scale (indicated by the horizontal contours), since all the 
data is explained as noise. The region where £ ~ 0.5 (left hand side of panel a) corresponds to 
the case where the length scale is very short; in this regime, the marginal likelihood is insensitive 
to the noise level, since the data is perfectly interpolated. Neither of these regions would be 
chosen by a good optimizer. 


a1 
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Figure 15.6 Three different approximations to the posterior over hyper-parameters: grid-based, Monte 
Carlo, and central composite design. Source: Figure 3.2 of (Vanhatalo 2010). Used with kind permission 
of Jarno Vanhatalo. 


Bayesian inference for the hyper-parameters 


An alternative to computing a point estimate of the hyper-parameters is to compute their poste- 
rior. Let 0 represent all the kernel parameters, as well as oz. If the dimensionality of @ is small, 


we can compute a discrete grid of possible values, centered on the MAP estimate 6 (computed 
as above). We can then approximate the posterior over the latent variables using 


S 
p(E|D) x X` p(£[D, @s)p(Os|D) 5s (15.25) 


=l 


where ôs denotes the weight for grid point s. 

In higher dimensions, a regular grid suffers from the curse of dimensionality. An obvious 
alternative is Monte Carlo, but this can be slow. Another approach is to use a form of quasi- 
Monte Carlo, whereby we place grid points at the mode, and at a distance +1sd from the mode 
along each dimension, for a total of 2|0| + 1 points. This is called a central composite design 
(Rue et al. 2009). (This is also used in the unscented Kalman filter, see Section 18.5.2.) To make 
this Gaussian-like approximation more reasonable, we often log-transform the hyper-parameters. 
See Figure 15.6 for an illustration. 
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Multiple kernel learning 


A quite different approach to optimizing kernel parameters known as multiple kernel learning. 
The idea is to define the kernel as a weighted sum of base kernels, x(x, x’) = XZ; wjKj(x,x’), 
and then to optimize the weights w, instead of the kernel parameters themselves. This is 
particularly useful if we have different kinds of data which we wish to fuse together. See 
e.g, (Rakotomamonjy et al. 2008) for an approach based on risk-minimization and convex 
optimization, and (Girolami and Rogers 2005) for an approach based on variational Bayes. 


Computational and numerical issues * 


The predictive mean is given by f, = k?K,y. For reasons of numerical stability, it is unwise 
to directly invert K,. A more robust alternative is to compute a Cholesky decomposition, 
K, = LL’. We can then compute the predictive mean and variance, and the log marginal 
likelihood, as shown in the pseudo-code in Algorithm 6 (based on (Rasmussen and Williams 
2006, pl9)). It takes O(N?) time to compute the Cholesky decomposition, and O(N?) time to 
solve for a = K; ty = L7TL-ty. We can then compute the mean using k? a in O(N) time 
and the variance using ką, — k7L~7L~'k, in O(N?) time for each test case. 

An alternative to Cholesky decomposition is to solve the linear system Kya = y using 
conjugate gradients (CG). If we terminate this algorithm after k iterations, it takes O(kN7) time. 
If we run for k = N, it gives the exact solution in O(N*) time. Another approach is to 
approximate the matrix-vector multiplies needed by CG using the fast Gauss transform. (Yang 
et al. 2005); however, this doesn’t scale to high-dimensional inputs. See also Section 15.6 for a 
discussion of other speedup techniques. 


Algorithm 15.1: GP regression 

1 L = cholesky(K + 071); 

2 a =L" \(L\y) 

3 E[f] = kta ; 

4v=L\k,; 

5 var [fs] = K(Xx, Xx) — vv; 

6 log p(y|X) = —5y7 a — X; log Li — J log(27) 


Semi-parametric GPs * 
Sometimes it is useful to use a linear model for the mean of the process, as follows: 
f(x) = 87 A(x) + r(x) (15.26) 


where r(x) ~ GP(0, «(x,x’)) models the residuals. This combines a parametric and a non- 
parametric model, and is known as a semi-parametric model. 

If we assume (3 ~ N (b, B), we can integrate these parameters out to get a new GP (O'Hagan 
1978): 


f(x) ~ GP (f(x)"b, r(x, x’) + o(x)B¢(x’)) (15.27) 
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2 
logp(yilfi) | a logpluil fi) | sz los p(uil fi) 
log sigm(y;f;) | ti — Ti —m(1 — Ti) 
iP(fi) gi i fi P( fi) 
log ®(y. fi) Bf) Te aa no 


Table 15.1 Likelihood, gradient and Hessian for binary logistic/ probit GP regression. We assume y; € 
{—1, +1} and define t; = (y;+1)/2 € {0,1} and 7; = sigm(f;) for logistic regression, and 7; = ®( fi) 
for probit regression. Also, œ and ® are the pdf and cdf of V(0, 1). From (Rasmussen and Williams 2006, 
p43). 


Integrating out Ø, the corresponding predictive distribution for test inputs X., has the following 
form (Rasmussen and Williams 2006, p28): 


PEIX, X, y) = N (E, cov [fs]) (15.28) 
f, = ®f6+K{K,'(y— £8) (15.29) 

B = (®'K;'®+B"')|(®K,;'y +B 'b) (15.30) 

cov [f,] = K..—K{K,'K,+R7(B-'+K,'6")'R (15.31) 

R = 6,-6K;,'6, (15.32) 


The predictive mean is the output of the linear model plus a correction term due to the GP, and 
the predictive covariance is the usual GP covariance plus an extra term due to the uncertainty 


in B. 


GPs meet GLMs 


In this section, we extend GPs to the GLM setting, focussing on the classification case. As with 
Bayesian logistic regression, the main difficulty is that the Gaussian prior is not conjugate to 
the bernoulli/ multinoulli likelihood. There are several approximations one can adopt: Gaussian 
approximation (Section 8.4.3), expectation propagation (Kuss and Rasmussen 2005; Nickisch and 
Rasmussen 2008), variational (Girolami and Rogers 2006; Opper and Archambeau 2009), MCMC 
(Neal 1997; Christensen et al. 2006), etc. Here we focus on the Gaussian approximation, since it 
is the simplest and fastest. 


Binary classification 


In the binary case, we define the model as p(y;|x;) = o(y:f(x:)), where, following (Rasmussen 
and Williams 2006), we assume y; E€ {—1, +1}, and we let o(z) = sigm(z) (logistic regression) 
or o(z) = ®(z) (probit regression). As for GP regression, we assume f ~ GP(0, x). 


Computing the posterior 


Define the log of the unnormalized posterior as follows: 


1 1 N 
L(£) = log p(y|f£) + log p(£|X) = log p(y|f) — sf et =5 log |K| — 5 log2a (15.33) 
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Let J(f) = —@(f) be the function we want to minimize. The gradient and Hessian of this are 
given by 
g = —Vlogp(y|f)+K~'f (15.34) 
H -VV log p(y|f) +K! = W +K! (15.35) 


Note that W = —VV log p(y|f) is a diagonal matrix because the data are iid (conditional on 
f). Expressions for the gradient and Hessian of the log likelihood for the logit and probit case 
are given in Sections 8.3.1 and 9.4.1, and summarized in Table 15.1. 

We can use IRLS to find the MAP estimate. The update has the form 


pew = F-H'g=f4+(K~'+W)1(Vlogp(y|f) — Kf) (15.36) 
(K-t + W) (WE + Vlog p(y|f)) (15.37) 
At convergence, the Gaussian approximation of the posterior takes the following form: 


pEIX, y) ~ NÊ, (K7! + W)7') (15.38) 


Computing the posterior predictive 


We now compute the posterior predictive. First we predict the latent function at the test case 
x,. For the mean we have 


Uf Xy] = [EEx X, y] PEX, y)dF 05.39) 
= JEKE vtix.yat (15.40) 
= k?K-'E[f|X,y] +k? K—!f (15.41) 


where we used Equation 15.8 to get the mean of f, given noise-free f. 
To compute the predictive variance, we use the rule of iterated variance: 


var |f] = E [var [f.|f]] + var [E [f.|f]] (15.42) 


where all probabilities are conditioned on x,., X, y. From Equation 15.9 we have 


i [var [f,|£]] = E [kus —k7K7~'k,] = ky. —k?K7'k, (15.43) 


From Equation 15.9 we have 


var [E [f,|f]] = var [k,K~'f] =k? K~'cov [f] K~'k, (15.44) 


Combining these we get 
var [f,] = kee — k? (K! — K` tcov [f] Ktk, (15.45) 


From Equation 15.38 we have cov [f] ~ (K-t + W)~?. Using the matrix inversion lemma we 
get 


var [fa] kee —k?K-'k, +k? K-K! + W) IK tk, (15.46) 


= ka — k? (K + W`!) ik. (15.47) 


Q 
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So in summary we have 


D( felX+, X, y) =N(E [fs] , var [f.]) (15.48) 


To convert this in to a predictive distribution for binary responses, we use 


a te ee J ARE (15.49) 


This can be approximated using any of the methods discussed in Section 8.4.4, where we 
discussed Bayesian logistic regression. For example, using the probit approximation of Sec- 
tion 8.4.4.2, we have 7, ~ sigm(«(v)E [f,]), where v = var [f,] and «?(v) = (1 + 7v/8)71. 


Computing the marginal likelihood 


We need the marginal likelihood in order to optimize the kernel parameters. Using the Laplace 
approximation in Equation 8.54 we have 


a l 
log p(y|X) ~ (£) — 3 lee |H| + const (15.50) 
Hence 
me Agee 1 E 
logp(y|X) =~ log p(ylf) — 5f K f — 5 log |K] — 5 log |K + WI (15.51) 


O log p(y|X,@) 
00; 


Computing the derivatives is more complex than in the regression case, since f 


and W, as well as K, depend on 6. Details can be found in (Rasmussen and Williams 2006, 
p125). 
Numerically stable computation * 


To implement the above equations in a numerically stable way, it is best to avoid inverting K 
or W. (Rasmussen and Williams 2006, p45) suggest defining 


B = Iy + W KW? (15.52) 


which has eigenvalues bounded below by 1 (because of the I) and above by 1 + x max;; Kij 
(because wi; = 7;(1 — 7) < 0.25), and hence can be safely inverted. 
One can use the matrix inversion lemma to show 


(K~ +W)! =K -KWB '!W?2K (15.53) 


Hence the IRLS update becomes 


P = (K +W) (Wf + V logplyl£f)) (15.54) 
——_$_ $_$_$_$_ MMM 
b 
= K(I—-W?B-'W?K)b (15.55) 


= K(b—W2?2L?! \ (L\(W?Kb))) (15.56) 
se 


a 
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where B = LL? is a Cholesky decomposition of B. The fitting algorithm takes in O(T N°) 
time and O(N?) space, where T is the number of Newton iterations. 

At convergence we have a = K~!f, so we can evaluate the log marginal likelihood (Equa- 
tion 15.51) using 


~ Loe 
log p(y|X) = log p(ylf) — 5af — ) log Lin (15.57) 


where we exploited the fact that 


|B| = |K||K~! + W| =|Iv + W KW?]| (15.58) 


We now compute the predictive distribution. Rather than using E [f,.] = k? K~—!f, we exploit 
the fact that at the mode, V? = 0, so f = K(V log p(y|f)). Hence we can rewrite the predictive 
mean as follows:” 


a [fa] = kx V log p(y/f) (15.59) 
To compute the predictive variance, we exploit the fact that 
(K+ W7!)-! = W?W-?2(K + Wt) WwW? = W2B IW? (15.60) 
to get 
var [fe] = Kew — kT W2 (LLT) Wk, = ky, — vv (15.61) 


where v = L \ (W?2k,). We can then compute 7. 

The whole algorithm is summarized in Algorithm 16, based on (Rasmussen and Williams 2006, 
p46). Fitting takes O(N) time, and prediction takes O(N? N.) time, where N, is the number 
of test cases. 


Example 


In Figure 15.7, we show a synthetic binary classification problem in 2d. We use an SE kernel. On 
the left, we show predictions using hyper-parameters set by hand; we use a short length scale, 
hence the very sharp turns in the decision boundary. On the right, we show the predictions 
using the learned hyper-parameters; the model favors a more parsimonious explanation of the 
data. 


Multi-class classification 


In this section, we consider a model of the form p(y;|x;) = Cat(y,|S(f;)), where f; = 
(fi1,---, fic), and we assume fe ~ GP(0,«,). Thus we have one latent function per class, 
which are a priori independent, and which may use different kernels. As before, we will use 
a Gaussian approximation to the posterior. (A similar model, but using the multinomial probit 
function instead of the multinomial logit, is described in (Girolami and Rogers 2006).) 


2. We see that training points that are well-predicted by the model, for which V; log p(yi| fi) ~ 0, do not contribute 
strongly to the prediction at test points; this is similar to the behavior of support vectors in an SVM (see Section 14.5). 
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Algorithm 15.2: GP binary classification using Gaussian approximation 


1 // First compute MAP estimate using IRLS; 
2 f=; 

3 repeat 

4 W = -VV log plylf) ; 


s | B=Iy+W?KW:; 

6 L = cholesky(B) ; 

7 b = Wf + V log p(y|f) ; 

8 | a=b- WELT \ (L\ (W?Kb)); 
9 f = Ka; 

10 until converged; 

u log p(y|X) = logp(y|f) — 5a? f — J; log Liss 
12 // Now perform prediction ; 

3 E[f.] = k? V log p(y|f); 

uv=L\ (W?k,); 

15 var [fs] = Kee — VV; 


16 p(y = 1) = f sigm(2)N(2IE [fe] , var [f+])dz; 


SE kernel, I=0.500, °=10.000 SE kernel, l=1.280, °=14.455 


(a) (b) 


Figure 15.7 Contours of the posterior predictive probability for the red circle class generated by a GP with 
an SE kernel. Thick black line is the decision boundary if we threshold at a probability of 0.5. (a) Manual 
parameters, short length scale. (b) Learned parameters, long length scale. Figure generated by gpcDemo2d, 
based on code by Carl Rasmussen. 
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Computing the posterior 


The unnormalized log posterior is given by 


N C 
ER a a Tf- X` log Xex fi = ik i 2 (15.62) 
~~ 5 y g P Jie 5 g 5) og 2T f 


i=1 c=1 
where 
f= (fin Nis fiz fN: n fics- fno)" (15.63) 
and y is a dummy encoding of the y;’s which has the same layout as f. Also, K is a block 
diagonal matrix containing Ke, where Ke = [k-(x;,x,;)] models the correlation of the c’'th 


latent function. 
The gradient and Hessian are given by 


Ve = -K lf+y--7 (15.64) 
VVé = -K!-W (15.65) 


where W £ diag(z) — ITI”, where II is a CN x N matrix obtained by stacking diag(7..) 
vertically. (Compare these expressions to standard logistic regression in Section 8.3.7.) 
We can use IRLS to compute the mode. The Newton step has the form 


fre’ — (K +W) |! (WE +y- r) (15.66) 
Naively implementing this would take O(C N?) time. However, we can reduce this to O(C.N°), 
as shown in (Rasmussen and Williams 2006, p52). 

Computing the posterior predictive 


We can compute the posterior predictive in a manner analogous to Section 15.3.1.2. For the 
mean of the latent response we have 


E [fre] = ke(xs)’ Kz lf, = ke(xx)" (ye — fe) (15.67) 
We can put this in vector form by writing 
[E] = Q.T(y — 7) (15.68) 
where 
ki (x.) 0 
Q. = a (15.69) 
0 .. ko(x«) 


Using a similar argument to Equation 15.47, we can show that the covariance of the latent 
response is given by 


cov [f] = E +QIK-!(K! + W) IK IQ, (15.70) 
= diag(k(x,,x,)) -Q7(K + W7')-'Q, (15.71) 
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where © is a C x C diagonal matrix with Xec = Ke(X«,X») — k?(x.)K7'k-(x.), and 
K(X, 2) = [hel Xe, X2)] 
To compute the posterior predictive for the visible response, we need to use 


D(y|X«,X,y) & l Cat(y|S (£ )) N (£ |E [£+] , cov [£] ) df. (15.72) 


We can use any of deterministic approximations to the softmax function discussed in Sec- 
tion 21.8.1.1 to compute this. Alternatively, we can just use Monte Carlo. 


Computing the marginal likelihood 


Using arguments similar to the binary case, we can show that 


N C 
Ja 7 A a dl 
log p(y|X) = —5f7 Kf + y7f — X log (3 exp fa) = log [Taw + W2KW2(15.73) 


i=1 c=1 


This can be optimized numerically in the usual way. 


Numerical and computational issues 


One can implement model fitting in O(TCN?) time and O(CN?) space, where T is the 
number of Newton iterations, using the techniques described in (Rasmussen and Williams 2006, 
p50). Prediction takes O(CN3 + CN?N,) time, where N, is the number of test cases. 


GPs for Poisson regression 


In this section, we illustrate GPs for Poisson regression. An interesting application of this is to 
spatial disease mapping. For example, (Vanhatalo et al. 2010) discuss the problem of modeling 
the relative risk of heart attack in different regions in Finland. The data consists of the heart 
attacks in Finland from 1996-2000 aggregated into 20km x 20km lattice cells. The model has 
the following form: 


where e; is the known expected number of deaths (related to the population of cell ¿ and the 
overall death rate), and r; is the relative risk of cell ¿ which we want to infer. Since the 
data counts are small, we regularize the problem by sharing information with spatial neighbors. 
Hence we assume f = log(r) ~ GP(0,«), where we use a Matern kernel with v = 3/2, and a 
length scale and magnitude that are estimated from data. 

Figure 15.8 gives an example of the kind of output one can obtain from this method, based 
on data from 911 locations. On the left we plot the posterior mean relative risk (RR), and on the 
right, the posterior variance. We see that the RR is higher in Eastern Finland, which is consistent 
with other studies. We also see that the variance in the North is higher, since there are fewer 
people living there. 
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Posterior variance of the relative risk, FIC 


Posterior mean of the relative risk, FIC 
60 m 


(a) 


Figure 15.8 We show the relative risk of heart disease in Finland using a Poisson GP. Left: posterior mean. 
Right: posterior variance. Figure generated by gpSpatialDemoLaplace, written by Jarno Vanhatalo. 


Connection with other methods 


There are variety of other methods in statistics and machine learning that are closely related to 
GP regression/ classification. We give a brief review of some of these below. 


Linear models compared to GPs 


Consider Bayesian linear regression for D-dimensional features, where the prior on the weights 
is p(w) = M (0, £). The posterior predictive distribution is given by the following; 


Plx X, y) = N(u,07) (15.75) 
1 

u = 5 xA X]y (15.76) 
Oy 

o = xA x, (15.77) 


where A = ay 2XTX +71. One can show that we can rewrite the above distribution as 
follows 


= x DX7(K+07I)"y (15.78) 
o = x’dx,—x EX? (K+ oI) 'XEx, (15.79) 


where we have defined K = XX", which is of size N x N. Since the features only ever 
appear in the form XEXT, x? EX" or x7 Dx,, we can kernelize the above expression by 
defining K(x, x’) = x7 Dx’. 

Thus we see that Bayesian linear regression is equivalent to a GP with covariance function 
k(x, x’) = xT Dx’. Note, however, that this is a degenerate covariance function, since it has at 
most D non-zero eigenvalues. Intuitively this reflects the fact that the model can only represent 
a limited number of functions. This can result in underfitting, since the model is not flexible 
enough to capture the data. What is perhaps worse, it can result in overconfidence, since the 
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model’s prior is so impoverished that its posterior will become too concentrated. So not only is 
the model wrong, it think it’s right! 


Linear smoothers compared to GPs 


A linear smoother is a regression function which is a linear function of the training outputs: 


F(x%) = X wi(Xs) yi (15.80) 


where w;(x,) is called the weight function (Silverman 1984). (Do not confuse this with a linear 
model, where the output is a linear function of the input vector.) 

There are a variety of linear smoothers, such as kernel regression (Section 14.7.4), locally 
weighted regression (Section 14.7.5), smoothing splines (Section 15.4.6), and GP regression. To see 
that GP regession is a linear smoother, note that the mean of the posterior predictive distribution 
of a GP is given by 


f(x) = ki (K + o2In) ty = Sn (xx) (15.81) 


where w;(x,) = (K + Ooh) tks]; 

In kernel regression, we derive the weight function from a smoothing kernel rather than a 
Mercer kernel, so it is clear that the weight function will then have local support. In the case 
of a GP, things are not as clear, since the weight function depends on the inverse of K. For 
certain GP kernel functions, we can analytically derive the form of w;(x); this is known as the 
equivalent kernel (Silverman 1984). One can show that oS w;(x,) = 1, although we may 
have w;(x.) < 0, so we are computing a linear combination but not a convex combination of 
the y;’s. More interestingly, w;(x..) is a local function, even if the original kernel used by the GP 
is not local. Futhermore the effective bandwidth of the equivalent kernel of a GP automatically 
decreases as the sample size N increases, whereas in kernel smoothing, the bandwidth h needs 
to be set by hand to adapt to N. See e.g., (Rasmussen and Williams 2006, Sec 2.6,Sec 7.1) for 
details. 


Degrees of freedom of linear smoothers 


It is clear why this method is called “linear”, but why is it called a “smoother”? This is best 
explained in terms of GPs. Consider the prediction on the training set: 


f = K(K+0%)"y (15.82) 


Now let K have the eigendecomposition K = S į Aju;u; . Since K is real and symmetric 
positive definite, the eigenvalues \; are real and non- aeetive and the eigenvectors u; are 
orthonormal. Now let y = = yiti, where yi = uly. Then we can rewrite the above 
equation as follows: 


ye were (15.83) 
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This is the same as Equation 7.47, except we are working with the eigenvectors of the Gram 


matrix K instead of the data matrix X. In any case, the interpretation is similar: if oo <1, 
yY 


Ait 
then the corresponding basis function u; will not have much influence. Consequently the high- 
frequency components in y are smoothed out. The effective degrees of freedom of the linear 


smoother is defined as 


Ài 


cae (15.84) 


N 


N 
dof Ê tr(K(K + 217) = X` 
i=] 


e 


This specifies how “wiggly” the curve is. 


SVMs compared to GPs 
We saw in Section 14.5.2 that the SVM objective for binary classification is given by Equation 14.57 


N 
1 
J(w) = lil? D (15.85) 
i= 
We also know from Equation 14.59 that the optimal solution has the form w = 5°, a;xi, 
so ||w||? = as aiajx] xj. Kernelizing we get ||w||? = aKa. From Equation 14.61, and 
absorbing the o term into one of the kernels, we have f = Ka, so ||w||? = f7 K~'f. Hence 
the SVM objective can be rewritten as 


N 
1 
J(f) = PEFC So (1 = yifi)4 (15.86) 
i=l 


Compare this to MAP estimation for GP classifier: 


N 
1 
J(£) = Ga X log p(yil fi) (15.87) 
4=1 


It is tempting to think that we can “convert” an SVM into a GP by figuring out what likelihood 
would be equivalent to the hinge loss. However, it turns out there is no such likelihood (Sollich 
2002), although there is a pseudo-likelihood that matches the SVM (see Section 14.5.5). 

From Figure 6.7 we saw that the hinge loss and the logistic loss (as well as the probit loss) 
are quite similar to each other. The main difference is that the hinge loss is strictly 0 for errors 
larger than 1. This gives rise to a sparse solution. In Section 14.3.2, we discussed other ways 
to derive sparse kernel machines. We discuss the connection between these methods and GPs 
below. 


LIVM and RVMs compared to GPs 


Sparse kernel machines are just linear models with basis function expansion of the form @(x) = 
[k(x, X1), .--, K(X, Xy )]. From Section 15.4.1, we know that this is equivalent to a GP with the 
following kernel: 


Zai 
n(x, x’) =X —o;(x) oj x’) (15.88) 
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where p(w) = MN (0, diag(a;*)). This kernel function has two interesting properties. First, it 
is degenerate, meaning it has at most N non-zero eigenvalues, so the joint distribution p(f, f..) 
will be highly constrained. Second, the kernel depends on the training data. This can cause the 
model to be overconfident when extrapolating beyond the training data. To see this, consider 
a point x, far outside the convex hull of the data. All the basis functions will have values 
close to 0, so the prediction will back off to the mean of the GP. More worryingly, the variance 
will back off to the noise variance. By contrast, when using a non-degenerate kernel function, 
the predictive variance increases as we move away from the training data, as desired. See 
(Rasmussen and Quifionero-Candela 2005) for further discussion. 


Neural networks compared to GPs 


In Section 16.5, we will discuss neural networks, which are a nonlinear generalization of GLMs. 
In the binary classification case, a neural network is defined by a logistic regression model 
applied to a logistic regression model: 


p(y|x, 8) = Ber (y|sigm (w* sigm(Vx)) ) (15.89) 


It turns out there is an interesting connection between neural networks and Gaussian processes, 
as first pointed out by (Neal 1996). 

To explain the connection, we follow the presentation of (Rasmussen and Williams 2006, p91). 
Consider a neural network for regression with one hidden layer. This has the form 


p(y|x, 0) = N (yl f(x; @), 07) (15.90) 
where 
H 
f(x) =b+ X vj9(x; u) (15.9) 
j=l 


where b is the offset of bias term, v; is the output weight from hidden unit j to the response 
y, uj are the inputs weights to unit j from the input x, and g() is the hidden unit activation 
function. This is typically the sigmoid or tanh function, but can be any smooth function. 

Let us use the following priors on the weights: where b ~ (0,07) v ~ II; N(v;|0, 02,), 
u ~ |], p(u;) for some unspecified p(u;). Denoting all the weights by @ we have 


fe [f(x)] = 0 (15.92) 
be Lf(x)f(x')] = of + ay oy Ey [g(x; uy) g(x’; uy) (15.93) 
= of + Ho7Ey (g(x; u)g(x’; u)] (15.94) 


where the last equality follows since the H hidden units are iid. If we let oł scale as w?/H 
(since more hidden units will increase the input to the final node, so we should scale down 
the magnitude of the weights), then the last term becomes w“E,y [g(x; u)g(x’; u)]. This is a 
sum over H iid random variables. Assuming that g is bounded, we can apply the central limit 
theorem. The result is that as H — oo, we get a Gaussian process. 


N 
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Figure 15.9 (a) Covariance function kn n(x, x’) for oo = 10, o = 10. (b) Samples from from a GP with 
this kernel, using various values of ø. Figure generated by gpnnDemo, written by Chris Williams. 


If we use as activation / transfer function g(x; u) = erf(uo + Ee u,jx;), where erf(z) = 


2//m k et dt, and we choose u ~ N (0, ©), then (Williams 1998) showed that the covariance 
kernel has the form 


h2 -i 257 ox! 
Knn(X,X) = -sin — = = (15.95) 
T V +23TXX)(1 + 2(X')T EX’) 


where X = (1,21,...,2p). This is a true “neural network” kernel, unlike the “sigmoid” kernel 
k(x, x’) = tanh(a + bx" x’), which is not positive definite. 

Figure 15.9(a) illustrates this kernel when D = 2 and © = diag(o?, a”). Figure 15.9(b) shows 
some functions sampled from the corresponding GP. These are equivalent to functions which 
are superpositions of erf(wo + ux) where uo and u are random. As o° increases, the variance 
of u increases, so the function varies more quickly. Unlike the RBF kernel, functions sampled 
from this kernel do not tend to 0 away from the data, but rather they tend to remain at the 
same value they had at the “edge” of the data. 

Now suppose we use an RBF network, which is equivalent to a hidden unit activation function 
of the form g(x;u) = exp(—|x — ul?/(207)). If u ~ N(0,0%]), one can show that the 
coresponding kernel is equivalent to the RBF or SE kernel. 


Smoothing splines compared to GPs * 


Smoothing splines are a widely used non-parametric method for smoothly interpolating data 
(Green and Silverman 1994). They are are a special case of GPs, as we will see. They are usually 
used when the input is 1 or 2 dimensional. 


Univariate splines 


The basic idea is to fit a function f by minimizing the discrepancy to the data plus a smoothing 
term that penalizes functions that are “too wiggly’. If we penalize the m’th derivative of the 
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function, the objective becomes 


I(f) = oF (wi) — yi)? + a [E e) ar (15.96) 


One can show (Green and Silverman 1994) that the solution is a piecewise polynomial where 
the polynomials have order 2m — 1 in the interior bins |xi—1, £;] (denoted Z), and order m — 1 
in the two outermost intervals (—oo, x1] and [xy , 00): 


m—-1 N N 
fæ) = X bjx + (a ET) (>: ai (x — ai) +I(x gT) (>. a(x — ayp Json 


i=l 


For example, if m = 2, we get the (natural) cubic spline 


N N 
f(x) = bo + Aix + I(x € T) (>. ay (a — oi) + I(x g T) (>. alx — z+) (15.98) 


i=l i=l 


which is a series of truncated cubic polynomials, whose left hand sides are located at each of the 
N training points. (The fact that the model is linear on the edges prevents it from extrapolating 
too wildly beyond the range of the data; if we drop this requirement, we get an “unrestricted” 
spline.) 

We can clearly fit this model using ridge regression: Ww = (TP + MAIn) tB" y, where the 
columns of ® are 1, x; and (x — «,)3. for i = 2: N — 1 and (x — 2,)4 fori = 1 ori = N. 
However, we can also derive an O(N) time method (Green and Silverman 1994, Sec 2.3.3). 


Regression splines 


In general, we can place the polynomials at a fixed set of K locations known as knots, denoted 
Ek. The result is called a regression spline. This is a parametric model, which uses basis 
function expansion of the following form (where we drop the interior/ exterior distinction for 
simplicity): 


K 
f(a) = Bo + Bia +X ale- Ei (15.99) 


k=1 


Choosing the number and locations of the knots is just like choosing the number and values of 
the support vectors in Section 14.3.2. If we impose an £2 regularizer on the regression coefficients 
aj, the method is known as penalized splines. See Section 9.6.1 for a practical example of 
penalized splines. 


The connection with GPs 


One can show (Rasmussen and Williams 2006, p139) that the cubic spline is the MAP estimate 
of the following function 


f(z) = Bo + fix + r(x) (15.100) 
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where p(8;) o 1 (so that we don't penalize the zero’th and first derivatives of f), and r(x) ~ 
GP(0, o$ Ksp(£, x”)), where 


1 
Kelar) 2 / (x —u)4(a’ — u)idu (15.101) 
0 


Note that the kernel in Equation 15.101 is rather unnatural, and indeed posterior samples from 
the resulting GP are rather unsmooth. However, the posterior mode/mean is smooth. This shows 
that regularizers don’t always make good priors. 
2d input (thin-plate splines) 


One can generalize cubic splines to 2d input by defining a regularizer of the following form: 


ICES HEB) GP) oe pe 


One can show that the solution has the form 


N 
f(z) = Bo +87x +Y aigi(x) (15.103) 


i=l 


where @;(x) = 7(||x — x;||), and 7(z) = 2? log z?. This is known as a thin plate spline. This 
is equivalent to MAP estimation with a GP whose kernel is defined in (Williams and Fitzgibbon 
2006). 


Higher-dimensional inputs 


It is hard to analytically solve for the form of the optimal solution when using higher-order 
inputs. However, in the parametric regression spline setting, where we forego the regularizer on 
f, we have more freedom in defining our basis functions. One way to handle multiple inputs is 
to use a tensor product basis, defined as the cross product of 1d basis functions. For example, 
for 2d input, we can define 


f(t1,22) = Bot 5 Pimli = Eim) pF 5 Bom(£2 — E2m)+ (15.104) 
a ye Bi2m(£1 — S10 )4 (£2 — fom) (15.105) 


It is clear that for high-dimensional data, we cannot allow higher-order interactions, because 
there will be too many parameters to fit. One approach to this problem is to use a search 
procedure to look for useful interaction terms. This is known as MARS, which stands for 
“multivariate adaptive regression splines”. See Section 16.3.3 for details. 


RKHS methods compared to GPs * 


We can generalize the idea of penalizing derivatives of functions, as used in smoothing splines, 
to fit functions with a more general notion of smoothness. Recall from Section 14.2.3 that 
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Mercer’s theorem says that any positive definite kernel function can be represented in terms of 
eigenfunctions: 


= X didi(x)di(x’) (15.106) 
The ¢; form an orthormal basis for a function space: 


He ={f: f(x = ¥ fein ao /Xi < oo} (15.107) 


w=1 


Now define the inner product between two functions f(x) = Xzc; fidi(x) and g(x) = 
52] 9:i(x) in this space as follows: 


(gu = >> hg £ (15.108) 
i=1 ** 


In Exercise 15.1, we show that this definition implies that 
(«(X1,-), &(X2,°))3 = K(X1, X2) (15.109) 


This is called the reproducing property, and the space of functions Hx is called a reproducing 
kernel Hilbert space or RKHS. 
Now consider an optimization problem of the form 


1 2 
=55 So F)? + 5I (15.110) 


Y i=1 


where || /||_7 is the norm of a function: 


lla = (f, f) n= 


The intuition is that functions that are complex wrt the kernel will have large norms, because 
they will need many eigenfunctions to represent them. We want to pick a simple function that 
provides a good fit to the data. 

One can show (see e.g., (Schoelkopf and Smola 2002)) that the solution must have the form 


(15.111) 


YS 


N 
= X a;s(x, xi) (15.112) 


This is known as the representer theorem, and holds for other convex loss functions besides 
squared error. 

We can solve for the a by substituting in f(x) = ae aik(X, X;) and using the reproducing 
property to get 


1 
J(a)= t jy Ka|? + 50 Tka (15.113) 


20 202 
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Minimizing wrt œ we find 


â= (K +I)! (15.114) 
and hence 
F (xs) = X âin(xa x) = kT (K + 021I) ty (15.115) 


This is identical to Equation 15.18, the posterior mean of a GP predictive distribution. Indeed, 
since the mean and mode of a Gaussian are the same, we can see that linear regresson with an 
RKHS regularizer is equivalent to MAP estimation with a GP. An analogous statement holds for 
the GP logistic regression case, which also uses a convex likelihood / loss function. 


GP latent variable model 


In Section 14.4.4, we discussed kernel PCA, which applies the kernel trick to regular PCA. In 
this section, we discuss a different way to combine kernels with probabilistic PCA. The resulting 
method is known as the GP-LVM, which stands for “Gaussian process latent variable model” 
(Lawrence 2005). 

To explain the method, we start with PPCA. Recall from Section 12.2.4 that the PPCA model is 
as follows: 


p(zi) = N(z,|0,T) (15.116) 
Plyilzi,8) = N(yi|W2;,07T) (15.117) 


We can fit this model by maximum likelihood, by integrating out the z; and maximizing W 
(and o?). The objective is given by 


p(Y|W, o?) = (27)~PN/2|C|-%/? exp (-5uey7y)) (15.118) 


where C = WWT + 0?I. As we showed in Theorem 12.2.2, the MLE for this can be computed 
in terms of the eigenvectors of YTY. 

Now we consider the dual problem, whereby we maximize Z and integrate out W. We will 
use a prior of the form p(W) = Į J; M(w;|0, I). The corresponding likelihood becomes 


D 

p(Y|Z,o?) = [[N:al0, ZZ? + 07D) (15.119) 
d=1 

= (2n)-PXP1K,|-P/ exp (-510K;'¥¥")) (15.120) 


where K, = ZZ" + 071. Based on our discussion of the connection between the eigenvalues 
of YY” and of YTY in Section 14.4.4, it should come as no surprise that we can also solve 
the dual problem using eigenvalue methods (see (Lawrence 2005) for the details). 

If we use a linear kernel, we recover PCA. But we can also use a more general kernel: 
K, =K-+07°I, where K is the Gram matrix for Z. The MLE for Z will no longer be available 
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Figure 15.10 2d representation of 12 dimensional oil flow data. The different colors/symbols represent 
the 3 phases of oil flow. (a) Kernel PCA with Gaussian kernel. (b) GP-LVM with Gaussian kernel. The 
shading represents the precision of the posterior, where lighter pixels have higher precision. From Figure 1 
of (Lawrence 2005). Used with kind permission of Neil Lawrence. 


via eigenvalue methods; instead we must use gradient-based optimization. The objective is given 
by 
D 1 
p= log |K.| — ait(Kz YY") (15.121) 


and the gradient is given by 
oe Ol OK, 


ðZ; OK, ðZ; (15.122) 
t, z UU 
where 
de -1 Tte-1 = 
aK, E YY K - DK, (15.123) 


The form of K- will of course depend on the kernel used. (For example, with a linear kernel, 
ij 


where K, = ZZ" + oI, we have ae = Z.) We can then pass this gradient to any standard 
optimizer, such as conjugate gradient descent. 

Let us now compare GP-LVM to kernel PCA. In kPCA, we learn a kernelized mapping from 
the observed space to the latent space, whereas in GP-LVM, we learn a kernelized mapping from 
the latent space to the observed space. Figure 15.10 illustrates the results of applying kPCA and 
GP-LVM to visualize the 12 dimensional oil flow data shown in In Figure 14.9(a). We see that the 
embedding produced by GP-LVM is far better. If we perform nearest neighbor classification in 
the latent space, GP-LVM makes 4 errors, while kernel PCA (with the same kernel but separately 
optimized hyper-parameters) makes 13 errors, and regular PCA makes 20 errors. 

GP-LVM inherits the usual advantages of probabilistic generative models, such as the ability 
to handle missing data and data of different types, the ability to use gradient-based methods 
(instead of grid search) to tune the kernel parameters, the ability to handle prior information, 
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etc. For a discussion of some other probabilistic methods for (spectral) dimensionality reduction, 
see (Lawrence 2012). 


Approximation methods for large datasets 


The principal drawback of GPs is that they take O(N?) time to use. This is because of the 
need to invert (or compute the Cholesky decomposition of) the N x N kernel matrix K. A 
variety of approximation methods have been devised which take O(M?N) time, where M is a 
user-specifiable parameter. For details, see (Quinonero-Candela et al. 2007). 


Exercises 


Exercise 15.1 Reproducing property 
Prove Equation 15.109. 


16.1 


Adaptive basis function models 


Introduction 


In Chapters 14 and 15, we discussed kernel methods, which provide a powerful way to create non- 
linear models for regression and classification. The prediction takes the form f(x) = wT (x), 
where we define 


p(x) T [K(x, Hy), | K(x, H)] (16.1) 


and where u, are either all the training data or some subset. Models of this form essen- 
tially perform a form of template matching, whereby they compare the input x to the stored 
prototypes Hę. 

Although this can work well, it relies on having a good kernel function to measure the 
similarity between data vectors. Often coming up with a good kernel function is quite difficult. 
For example, how do we define the similarity between two images? Pixel-wise comparison of 
intensities (which is what a Gaussian kernel corresponds to) does not work well. Although it is 
possible (and indeed common) to hand-engineer kernels for specific tasks (see e.g., the pyramid 
match kernel in Section 14.2.7), it would be more interesting if we could learn the kernel. 

In Section 15.2.4, we discussed a way to learn the parameters of a kernel function, by maxi- 
mizing the marginal likelihood. For example, if we use the ARD kernel, 


D 
1 
k(x, x’) = 09 exp z 5 0;(£;j — oy (16.2) 
j=l 


we can can estimate the 0j, and thus perform a form of nonlinear feature selection. However, 
such methods can be computationally expensive. Another approach, known as multiple kernel 
learning (see e.g., (Rakotomamonjy et al. 2008)) uses a convex combination of base kernels, 
K(x,x’) = )/, wjK;(x,x’), and then estimates the mixing weights wj. But this relies on 
having good base kernels (and is also computationally expensive). 

An alternative approach is to dispense with kernels altogether, and try to learn useful features 
(x) directly from the input data. That is, we will create what we call an adaptive basis- 
function model (ABM), which is a model of the form 


M 
f(x) = wo + X Wmbm(x) (16.3) 


m=1 


16.2 


16.2.1 


544 Chapter 16. Adaptive basis function models 


where m (X) is the mth basis function, which is learned from data. This framework covers all 
of the models we will discuss in this chapter. 

Typically the basis functions are parametric, so we can write ¢,,(x) = 6(xX; Vm), where Vm 
are the parameters of the basis function itself. We will use © = (wo, Wi.m,{Vm}M_,) to 
denote the entire parameter set. The resulting model is not linear-in-the-parameters anymore, 
so we will only be able to compute a locally optimal MLE or MAP estimate of 9. Nevertheless, 
such models often significantly outperform linear models, as we will see. 


Classification and regression trees (CART) 


Classification and regression trees or CART models, also called decision trees (not to be 
confused with the decision trees used in decision theory) are defined by recursively partitioning 
the input space, and defining a local model in each resulting region of input space. This can be 
represented by a tree, with one leaf per region, as we explain below. 


Basics 


To explain the CART approach, consider the tree in Figure 16.l(a). The first node asks if xı is 
less than some threshold tı. If yes, we then ask if x2 is less than some other threshold tə. If 
yes, we are in the bottom left quadrant of space, R,. If no, we ask if xı is less than t3. And 
so on. The result of these axis parallel splits is to partition 2d space into 5 regions, as shown 
in Figure 16.1(b). We can now associate a mean response with each of these regions, resulting in 
the piecewise constant surface shown in Figure 16.1(c). 

We can write the model in the following form 


M M 
f(x) =Elylx] = $ wml € Rm) = $ wm(Xi Vm) (16.4) 
m=1 m=1 


where Rm is the m’th region, Wm is the mean response in this region, and Vm encodes the 
choice of variable to split on, and the threshold value, on the path from the root to the m’th leaf. 
This makes it clear that a CART model is just a an adaptive basis-function model, where the 
basis functions define the regions, and the weights specify the response value in each region. 
We discuss how to find these basis functions below. 

We can generalize this to the classification setting by storing the distribution over class labels 
in each leaf, instead of the mean response. This is illustrated in Figure 16.2. This model can 
be used to classify the data in Figure 1.1. For example, we first check the color of the object. 
If it is blue, we follow the left branch and end up in a leaf labeled “4,0”, which means we 
have 4 positive examples and 0 negative examples which match this criterion. Hence we predict 
p(y = 1|x) = 4/4 if x is blue. If it is red, we then check the shape: if it is an ellipse, we 
end up in a leaf labeled “1,1”, so we predict p(y = 1|x) = 1/2. If it is red but not an ellipse, 
we predict p(y = 1|x) = 0/2; If it is some other colour, we check the size: if less than 10, 
we predict p(y = 1|x) = 4/4, otherwise p(y = 1|x) = 0/5. These probabilities are just the 
empirical fraction of positive examples that satisfy each conjunction of feature values, which 
defines a path from the root to a leaf. 
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Figure 16.1 A simple regression tree on two inputs. Based on Figure 9.2 of (Hastie et al. 2009). Figure 
generated by regtreeSurfaceDemo. 
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Figure 16.2 A simple decision tree for the data in Figure 1.1. A leaf labeled as (nı, no) means that 
there are nı positive examples that match this path, and no negative examples. In this tree, most of 
the leaves are “pure”, meaning they only have examples of one class or the other; the only exception is 
leaf representing red ellipses, which has a label distribution of (1,1). We could distinguish positive from 
negative red ellipses by adding a further test based on size. However, it is not always desirable to construct 
trees that perfectly model the training data, due to overfitting. 


Growing a tree 


Finding the optimal partitioning of the data is NP-complete (Hyafil and Rivest 1976), so it is 
common to use the greedy procedure shown in Algorithm 6 to compute a locally optimal MLE. 
This method is used by CART, (Breiman et al. 1984) C4.5(Quinlan 1993), and ID3 (Quinlan 1986), 
which are three popular implementations of the method. (See dtfit for a simple Matlab 
implementation.) 

The split function chooses the best feature, and the best value for that feature, as follows: 


Yt) = i t({Xi, Yi : Liz L t t({Xi, Yi: Li >t 16. 
(J*, t) ae e ({Xi, Yi : Vig < t}) + cost({x;, Yi : Lij > th) (16.5) 
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Algorithm 16.1: Recursive procedure to grow a classification/ regression tree 


1 function fitTree(node, D, depth) ; 

2 node.prediction = mean(y; : i € D) // or class label distribution ; 
3 G ae Dr, Dr) = split(D); 

4 if not worthSplitting(depth, cost, Dz, Dr) then 


5 return node 

6 else 

7 node.test = Ax.a;» < t* // anonymous function; 
8 node.left = fitTree(node, Dz, depth+1); 

9 node.right = fitTree(node, Dpr, depth+); 

10 return node; 


where the cost function for a given dataset will be defined below. For notational simplicity, we 
have assumed all inputs are real-valued or ordinal, so it makes sense to compare a feature xj; 
to a numeric value t. The set of possible thresholds 7; for feature j can be obtained by sorting 
the unique values of x;j. For example, if feature 1 has the values {4.5, —12, 72, 12}, then we 
set Ti = {—12,4.5,72}. In the case of categorical inputs, the most common approach is to 
consider splits of the form x;; = Ck and zij Æ Ck, for each possible class label c. Although 
we could allow for multi-way splits (resulting in non-binary trees), this would result in data 
fragmentation, meaning too little data might “fall” into each subtree, resulting in overfitting. 

The function that checks if a node is worth splitting can use several stopping heuristics, such 
as the following: 


e is the reduction in cost too small? Typically we define the gain of using a feature to be a 
normalized measure of the reduction in cost: 

[Dz] 

|D| 


A cost(D) — ( cost(Dz) + Pr cost(Pr) (16.6) 


D 

e has the tree exceeded the maximum desired depth? 

e is the distribution of the response in either Dr, or Dpr sufficiently homogeneous (e.g., all 
labels are the same, so the distribution is pure)? 

e is the number of examples in either Dz or Dp too small? 


All that remains is to specify the cost measure used to evaluate the quality of a proposed 
split. This depends on whether our goal is regression or classification. We discuss both cases 
below. 


Regression cost 


In the regression setting, we define the cost as follows: 


cost(D) = X (yi - 9)? (16.7) 


iED 
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where 7 = DI Š iep Yi is the mean of the response variable in the specified set of data. 
Alternatively, we can fit a linear regression model for each leaf, using as inputs the features that 
were chosen on the path from the root, and then measure the residual error. 


Classification cost 


In the classification setting, there are several ways to measure the quality of a split. First, we 
fit a multinoulli model to the data in the leaf satisfying the test X; < t by estimating the 
class-conditional probabilities as follows: 


T ni 
fe = ID] 5 I(y; = c) (16.8) 
iE€D 
where D is the data in the leaf. Given this, there are several common error measures for 
evaluating a proposed partition: 


e Misclassification rate. We define the most probable class label as ĝe = argmax, îe. The 
corresponding error rate is then 


1 7 x 
ID] 5 Iyi #9) =1— g (16.9) 


iED 
e Entropy, or deviance: 
C 
H (&) = — X fe log ĉe (16.10) 
c=1 


Note that minimizing the entropy is equivalent to maximizing the information gain (Quinlan 
1986) between test X; < t and the class label Y, defined by 


infoGain(X; < t, Y) = H(Y)-H(Y|X; <t) (16.11) 
= (- Š rly = c) log ply = o) (16.12) 
+ (= ply =¢|X; < t) log p(c| X; < o) (16.13) 


since îe is an MLE for the distribution p(c|X; < t).! 


1. If Xj is categorical, and we use tests of the form X; = k, then taking expectations over values of X; gives 
the mutual information between X; and Y: E [infoGain(X;,Y)] = Xp p(X; = k)infoGain(X; = k,Y) = 
H(Y) —H(Y|X;) =1(Y; X;). 
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Figure 16.3 Node impurity measures for binary classification. The horizontal axis corresponds to p, the 
probability of class 1. The entropy measure has been rescaled to pass through (0.5,0.5). Based on Figure 
9.3 of (Hastie et al. 2009). Figure generated by giniDemo. 


e Gini index 


Cc 
Nal- id= iA a1 - > ae (16.14) 


This is the expected error rate. To see this, note that 7, is the probability a random entry in 
the leaf belongs to class c, and (1 — îe is the probability it would be misclassified. 


In the two-class case, where p = mm(1), the misclassification rate is 1 — max(p, 1 — p), the 
entropy is Hy (p), and the Gini index is 2p(1 — p). These are plotted in Figure 16.3. We see 
that the cross-entropy and Gini measures are very similar, and are more sensitive to changes in 
class probability than is the misclassification rate. For example, consider a two-class problem 
with 400 cases in each class. Suppose one split created the nodes (300,100) and (100,300), while 
the other created the nodes (200,400) and (200,0). Both splits produce a misclassification rate of 
0.25. However, the latter seems preferable, since one of the nodes is pure, i.e., it only contains 
one class. The cross-entropy and Gini measures will favor this latter choice. 


Example 


As an example, consider two of the four features from the 3-class iris dataset, shown in Fig- 
ure 16.4(a). The resulting tree is shown in Figure 16.5(a), and the decision boundaries are shown 
in Figure 16.4(b). We see that the tree is quite complex, as are the resulting decision boundaries. 
In Figure 16.5(b), we show that the CV estimate of the error is much higher than the training set 
error, indicating overfitting. Below we discuss how to perform a tree-pruning stage to simplify 
the tree. 


16.2. Classification and regression trees (CART) 


4.5/7 


w 
a 


Sepal width 


w 


25F 


to] 
o 00 oO o 20 


o O setosa 
o versicolor 
o © virginica 


o o ò o 


Ka 
900 © 
O 00 


o 9% 
000 o o0 8090 0000 OO oo 
oo 000090 0 © 
ong 00000 o o o 
o 090 0 

ooo 0 © 
ġo ooo fe} © 

o 


o o o 


oa 


4.5 5 5.5 


6 6.5 vd 75 8 
Sepal length 


(a) 


4.5 


3.5 


25 


549 


unpruned decision tree 
OODOOGONN0O0000000000099O0000000000000000 


O0000000000000000000009O0OOOOO]" n versicolor 
OO00000000000000000000000000000 oO setosa 

O00000000000000000000000000000 SEAD 
O00000000000000000000000000000|__ © _Virginica 


OOO0O0O0000000000000000000000000000000000000 
000000000000000000000090O00O0000000090000 
000000000000000000000099090O0000000000000 
00000000000000000000009000000000000000000 
OODOOGONON0G0000000000099OO000010000000000 
OODDOO0000000000000000099O0000010000000000 
00000000000000000000009HO9OOOO100000O0000 
00000000000000050000190000O900O000090000 
0000000000000000000000999900000000000OOOO 
oooo00000000000000000000000000000000000000 
0000000000000002090000000009010000000000 
00000000000000000000000000099O0999000O000 
O000000000000000000000090090000000000000009 
ooropo00r00r00000r00000000r00000000000000 
oopopor0rroro000r0r0000000001000010000000000 
oopo0o00rrorr000r0r000000000100000000000000 
opoo00oror00000000000000000000000000000000 
oooo000000000000rro00rrȘ000rr00110000000000 
oopopop0r00r000r0rooooor00r00000000000000 
oopopopoo00ro00roroooo0rooro0000000000000 


n p 5 x 
4 45 5 5.5 6 6.5 7 75y% 
x 


(b) 


16.2.3 


Figure 16.4 (a) Iris data. We only show the first two features, sepal length and sepal width, and ignore 
petal length and petal width. (b) Decision boundaries induced by the decision tree in Figure 16.5(a). 
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Figure 16.5 (a) Unpruned decision tree for Iris data. (b) Plot of misclassification error rate vs depth of 
tree. Figure generated by dtreeDemoIris. 


Pruning a tree 


To prevent overfitting, we can stop growing the tree if the decrease in the error is not sufficient 
to justify the extra complexity of adding an extra subtree. However, this tends to be too myopic. 
For example, on the xor data in Figure 14.2(c), it would might never make any splits, since each 
feature on its own has little predictive power. 

The standard approach is therefore to grow a “full” tree, and then to perform pruning. This 
can be done using a scheme that prunes the branches giving the least increase in the error. See 
(Breiman et al. 1984) for details. 

To determine how far to prune back, we can evaluate the cross-validated error on each such 
subtree, and then pick the tree whose CV error is within 1 standard error of the minimum. This 
is illustrated in Figure 16.4(b). The point with the minimum CV error corresponds to the simple 
tree in Figure 16.6(a). 
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Figure 16.6 Pruned decision tree for Iris data. Figure generated by dtreeDemoIris. 


Pros and cons of trees 


CART models are popular for several reasons: they are easy to interpret”, they can easily handle 
mixed discrete and continuous inputs, they are insensitive to monotone transformations of the 
inputs (because the split points are based on ranking the data points), they perform automatic 
variable selection, they are relatively robust to outliers, they scale well to large data sets, and 
they can be modified to handle missing inputs. 

However, CART models also have some disadvantages. The primary one is that they do 
not predict very accurately compared to other kinds of model. This is in part due to the 
greedy nature of the tree construction algorithm. A related problem is that trees are unstable: 
small changes to the input data can have large effects on the structure of the tree, due to the 
hierarchical nature of the tree-growing process, causing errors at the top to affect the rest of the 
tree. In frequentist terminology, we say that trees are high variance estimators. We discuss a 
solution to this below. 


Random forests 


One way to reduce the variance of an estimate is to average together many estimates. For 
example, we can train M different trees on different subsets of the data, chosen randomly with 


2. We can postprocess the tree to derive a series of logical rules such as “If xı < 5.45 then ...” (Quinlan 1990). 

3. The standard heuristic for handling missing inputs in decision trees is to look for a series of "backup” variables, 
which can induce a similar partition to the chosen variable at any given split; these can be used in case the chosen 
variable is unobserved at test time. These are called surrogate splits. This method finds highly correlated features, 
and can be thought of as learning a local joint model of the input. This has the advantage over a generative model 
of not modeling the entire joint distribution of inputs, but it has the disadvantage of being entirely ad hoc. A simpler 
approach, applicable to categorical variables, is to code “missing” as a new value, and then to treat the data as fully 
observed. 
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replacement, and then compute the ensemble 


M 


ORDD i (x) (16.15) 


m=1 


where fm is the m’th tree. This technique is called bagging (Breiman 1996), which stands for 
“bootstrap aggregating”. 

Unfortunately, simply re-running the same learning algorithm on different subsets of the data 
can result in highly correlated predictors, which limits the amount of variance reduction that is 
possible. The technique known as random forests (Breiman 200la) tries to decorrelate the base 
learners by learning trees based on a randomly chosen subset of input variables, as well as a 
randomly chosen subset of data cases. Such models often have very good predictive accuracy 
(Caruana and Niculescu-Mizil 2006), and have been widely used in many applications (e.g., for 
body pose recognition using Microsofts popular kinect sensor (Shotton et al. 2011). 

Bagging is a frequentist concept. It is also possible to adopt a Bayesian approach to learning 
trees. In particular, (Chipman et al. 1998; Denison et al. 1998; Wu et al. 2007) perform approximate 
inference over the space of trees (structure and parameters) using MCMC. This reduces the 
variance of the predictions. We can also perform Bayesian inference over the space of ensembles 
of trees, which tends to work much better. This is known as Bayesian adaptive regression 
trees or BART (Chipman et al. 2010). Note that the cost of these sampling-based Bayesian 
methods is comparable to the sampling-based random forest method. That is, both approaches 
are farily slow to train, but produce high quality classifiers. 

Unfortunately, methods that use multiple trees (whether derived from a Bayesian or frequen- 
tist standpoint) lose their nice interpretability properties. Fortunately, various post-processing 
measures can be applied, as discussed in Section 16.8. 


CART compared to hierarchical mixture of experts * 


An interesting alternative to a decision tree is known as the hierarchical mixture of experts. 
Figure 11.7(b) gives an illustration where we have two levels of experts. This can be thought of 
as a probabilistic decision tree of depth 2, since we recursively partition the space, and apply 
a different expert to each partition. Hastie et al. (Hastie et al. 2009, p331) write that “The 
HME approach is a promising competitor to CART trees’. Some of the advantages include the 
following: 


e The model can partition the input space using any set of nested linear decision boundaries. 
By contrast, standard decision trees are constrained to use axis-parallel splits. 


e The model makes predictions by averaging over all experts. By contrast, in a standard 
decision tree, predictions are made only based on the model in the corresponding leaf. Since 
leaves often contain few training examples, this can result in overfitting. 


e Fitting an HME involves solving a smooth continuous optimization problem (usually using 
EM), which is likely to be less prone to local optima than the standard greedy discrete 
optimization methods used to fit decision trees. For similar reasons, it is computationally 
easier to “be Bayesian” about the parameters of an HME (see e.g., (Peng et al. 1996; Bishop 
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and Svensén 2003)) than about the structure and parameters of a decision tree (see e.g., (Wu 
et al. 2007). 


Generalized additive models 


A simple way to create a nonlinear model with multiple inputs is to use a generalized additive 
model (Hastie and Tibshirani 1990), which is a model of the form 


f(x) =a+ fi(a1) +--+ + fo(xp) (16.16) 


Here each f; can be modeled by some scatterplot smoother, and f(x) can be mapped to p(y|x) 
using a link function, as in a GLM (hence the term generalized additive model). 

If we use regression splines (or some other fixed basis function expansion approach) for the 
fj, then each f;(x;) can be written as B; b; (23), so the whole model can be written as 
f(x) = BT &(x), where (x) = [1, 6, (x1),...,@p(xp)]. However, it is more common to use 
smoothing splines (Section 15.4.6) for the f;. In this case, the objective (in the regression setting) 
becomes 


N 


2 
D D 
Ja firs fo) =) | w-—a— D7 files) +r f HOPA (16.17) 
j=1 j=1 


i=l 


where A; is the strength of the regularizer for fj. 


Backfitting 


We now discuss how to fit the model using MLE. The constant a is not uniquely identifiable, 
since we can always add or subtract constants to any of the f; functions. The convention is to 
assume a fj (wij) = 0 for all j. In this case, the MLE for a is just â = $ yu Yi. 

To fit the rest of the model, we can center the responses (by subtracting â), and then 
iteratively update each f; in turn, using as a target vector the residuals obtained by omitting 
term fj: 


fj := smoother({y; — 5 felta E) (16.18) 
kžj 


We should then ensure the output is zero mean using 
1A 
Hh a So C (16.19) 
i=1 


This is called the backfitting algorithm (Hastie and Tibshirani 1990). If X has full column rank, 
then the above objective is convex (since each smoothing spline is a linear operator, as shown 
in Section 15.4.2), so this procedure is guaranteed to converge to the global optimum. 

In the GLM case, we need to modify the method somewhat. The basic idea is to replace the 
weighted least squares step of IRLS (see Section 8.3.4) with a weighted backfitting algorithm. In 
the logistic regression case, each response has weight s; = j4;(1 — pi) associated with it, where 
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Computational efficiency 


Each call to the smoother takes O(N) time, so the total cost is O( NDT), where T is the 
number of iterations. If we have high-dimensional inputs, fitting a GAM is expensive. One 
approach is to combine it with a sparsity penalty, see e.g., the SpAM (sparse additive model) 
approach of (Ravikumar et al. 2009). Alternatively, we can use a greedy approach, such as 
boosting (see Section 16.4.6) 


Multivariate adaptive regression splines (MARS) 


We can extend GAMs by allowing for interaction effects. In general, we can create an ANOVA 
decomposition: 


D 
F(x) = Bot >= Fili) +Y fiw (@y, Ek) + D> firl tr 21) +++ (16.20) 
j=l j,k 


j,k, 


Of course, we cannot allow for too many higher-order interactions, because there will be too 
many parameters to fit. 

It is common to use greedy search to decide which variables to add. The multivariate 
adaptive regression splines or MARS algorithm is one example of this (Hastie et al. 2009, 
Sec9.4). It fits models of the form in Equation 16.20, where it uses a tensor product basis of 
regression splines to represent the multidimensional regression functions. For example, for 2d 
input, we might use 


f (x1, £2) F bo +X Bim(x1 —tim)+ 
Te 5 Bom (tam = ©2)4 =f 5 Bram (xy = tim)+(t2m = x2) 4 (16.21) 


To create such a function, we start with a set of candidate basis functions of the form 
C= {(x; z= t)+, (t = £5 )+ :t€ {@1), dace JNa td =S , D} (16.22) 


These are ld linear splines where the knots are at all the observed values for that variable. We 
consider splines sloping up in both directions; this is called a reflecting pair. See Figure 16.7(a). 

Let M represent the current set of basis functions. We initialize by using M = {1}. We 
consider creating a new basis function pair by multplying an hm € M with one of the reflecting 
pairs in C. For example, we might initially get 


f(x) = 25 — 4(z1 — 5)4 + 20(5 — 21)4 (16.23) 


obtained by multiplying ho(x) = 1 with a reflecting pair involving xı with knot t = 5. This 
pair is added to M. See Figure 16.7(b). At the next step, we might create a model such as 


f(x) = =2-2(2, —5),4+3(5-21)4 
— (z2 —10)4 x (5 — 21) + —1.2(10 — 22)4. x (5-21) 4 (16.24) 


obtained by multiplying (5—2,)+ from M by the new reflecting pair (72-10), and (10—22) +. 
This new function is shown in Figure 16.7(c). 
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Figure 16.7 (a) Linear spline function with a knot at 5. Solid blue: (x — 5)+. Dotted red: (5—)+. (b) A 
MARS model in 1d given by Equation 16.23. (c) A simple MARS model in 2d given by Equation 16.24. Figure 
generated by marsDemo. 


We proceed in this way until the model becomes very large. (We may impose an upper 
bound on the order of interactions.) Then we prune backwards, at each step eliminating the 
basis function that causes the smallest increase in the residual error, until the CV error stops 
improving. 

The whole procedure is closely related to CART. To see this, suppose we replace the piecewise 
linear basis functions by step functions I(x; > t) and I(x; < t). Multiplying by a pair of 
reflected step functions is equivalent to splitting a node. Now suppose we impose the constraint 
that once a variable is involved in a multiplication by a candidate term, that variable gets 
replaced by the interaction, so the original variable is no longer available. This ensures that a 
variable can not be split more than once, thus guaranteeing that the resulting model can be 
represented as a tree. In this case, the MARS growing strategy is the same as the CART growing 
strategy. 


Boosting 


Boosting (Schapire and Freund 2012) is a greedy algorithm for fitting adaptive basis-function 
models of the form in Equation 16.3, where the m are generated by an algorithm called a weak 
learner or a base learner. The algorithm works by applying the weak learner sequentially to 
weighted versions of the data, where more weight is given to examples that were misclassified 
by earlier rounds. 

This weak learner can be any classification or regression algorithm, but it is common to use a 
CART model. In 1998, the late Leo Breiman called boosting, where the weak learner is a shallow 
decision tree, the “best off-the-shelf classifier in the world” (Hastie et al. 2009, p340). This 
is supported by an extensive empirical comparison of 10 different classifiers in (Caruana and 
Niculescu-Mizil 2006), who showed that boosted decision trees were the best both in terms of 
misclassification error and in terms of producing well-calibrated probabilities, as judged by ROC 
curves. (The second best method was random forests, invented by Breiman; see Section 16.2.5.) 
By contrast, single decision trees performed very poorly. 

Boosting was originally derived in the computational learning theory literature (Schapire 1990; 
Freund and Schapire 1996), where the focus is binary classification. In these papers, it was 
proved that one could boost the performance (on the training set) of any weak learner arbitrarily 
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Figure 16.8 Performance of adaboost using a decision stump as a weak learner on the data in Figure 16.10. 
Training (solid blue) and test (dotted red) error vs number of iterations. Figure generated by boostingDemo, 
written by Richard Stapenhurst. 


high, provided the weak learner could always perform slightly better than chance. For example, 
in Figure 16.8, we plot the training and test error for boosted decision stumps on a 2d dataset 
shown in Figure 16.10. We see that the training set error rapidly goes to near zero. What is more 
surprising is that the test set error continues to decline even after the training set error has 
reached zero (although the test set error will eventually go up). Thus boosting is very resistant 
to overfitting. (Boosted decision stumps form the basis of a very successful face detector (Viola 
and Jones 2001), which was used to generate the results in Figure 1.6, and which is used in many 
digital cameras.) 

In view of its stunning empirical success, statisticians started to become interested in this 
method. Breiman (Breiman 1998) showed that boosting can be interpreted as a form of gradient 
descent in function space. This view was then extended in (Friedman et al. 2000), who showed 
how boosting could be extended to handle a variety of loss functions, including for regression, 
robust regression, Poisson regression, etc. In this section, we shall present this statistical inter- 
pretation of boosting, drawing on the reviews in (Buhlmann and Hothorn 2007) and (Hastie et al. 
2009, chl10), which should be consulted for further details. 


Forward stagewise additive modeling 


The goal of boosting is to solve the following optimization problem: 


N 
min > L(yi, f(x:)) (16.25) 


and L(y, ĝ) is some loss function, and f is assumed to be an ABM model as in Equation 16.3. 
Common choices for the loss function are listed in Table 16.1. 
If we use squared error loss, the optimal estimate is given by 


fF (x)= me = Eix (Y — f(x))’] = E [Y |x] (16.26) 
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Name Loss Derivative f Algorithm 
Squared error (yi — F(xi))? yi- F(x) à [y|x:] L2Boosting 
Absolute error ly: — f(x:)| sgn(y; — f(x:)) median(y|x;) Gradient boosting 
Exponential loss exp(—9if(xi)) —iexp(—if(xi)) 4log 7 AdaBoost 

Logloss log(1 +e?) yYi— ni 5 log r LogitBoost 


Table 16.1 Some commonly used loss functions, their gradients, their population minimizers f*, and 
some algorithms to minimize the loss. For binary classification problems, we assume ĝi € {—1, +1}, 
yi € {0,1} and m; = sigm(2f(x;)). For regression problems, we assume y; € R. Adapted from (Hastie 
et al. 2009, p360) and (Buhlmann and Hothorn 2007, p483). 


\ — 0-1 
ae \ = = = =logloss 
= = exp 
6 A 
sA 
5l 
a 
ao 4 % 
a A 
2 vs 
8 ANS 
“Ns 
2 * 
ee 
ab 
~raon 
ò saTI IT me a m 
2 45 + 05 0 05 1 15 2 


Figure 16.9 Illustration of various loss functions for binary classification. The horizontal axis is the 
margin yn, the vertical axis is the loss. The log loss uses log base 2. Figure generated by hingeLossPlot. 


as we showed in Section 5.7.1.3. Of course, this cannot be computed in practice since it requires 
knowing the true conditional distribution p(y|x). Hence this is sometimes called the population 
minimizer, where the expectation is interpreted in a frequentist sense. Below we will see that 
boosting will try to approximate this conditional expectation. 

For binary classification, the obvious loss is 0-1 loss, but this is not differentiable. Instead 
it is common to use logloss, which is a convex upper bound on 0-1 loss, as we showed in 
Section 6.5.5. In this case, one can show that the optimal estimate is given by 


far — l, 20 = 1x) 
POO = 3108 Gj = 1) 


(16.27) 


where y € {—1, +1}. One can generalize this framework to the multiclass case, but we will not 
discuss that here. 
An alternative convex upper bound is exponential loss, defined by 


LY, f) = exp(-gf) (16.28) 


See Figure 16.9 for a plot. This will have some computational advantages over the logloss, 
to be discussed below. It turns out that the optimal estimate for this loss is also f*(x) = 
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5 log — To see this, we can just set the derivative of the expected loss (for each x) to 

Zero: 
=e [ik] = 9 tng = IxjeS© + 0G = —1pxJel™] (16.29) 

Of (x) Of (x) 
= —p(§ =1|x)eF™ + p(g = —1|x)ef (16.30) 
=l 
s ga a 25 (x) (16.31) 
p(y =1- |x) 


So in both cases, we can see that boosting should try to approximate (half) the log-odds ratio. 
Since finding the optimal f is hard, we shall tackle it sequentially. We initialise by defining 


N 
fo(x) = arg min 5 L(yi, f (xis Y)) (16.32) 
= 
For example, if we use squared error, we can set fo(x) = J, and if we use log-loss or exponential 
loss , we can set fo(x) = 4 log ~4:, where # = 4 eS I(y; = 1). We could also use a more 
powerful model for our baseline, such as a GLM. 
Then at iteration m, we compute 


N 
(Bm; Ym) = argmin X ` L(yi, fm—1(xi) + BO(xi; 7) (16.33) 


BY j=j 
and then we set 


fim (x) = fim—1(X) + BmO(X; Ym) (16.34) 


The key point is that we do not go back and adjust earlier parameters. This is why the method 
is called forward stagewise additive modeling. 

We continue this for a fixed number of iterations M. In fact M is the main tuning parameter 
of the method. Often we pick it by monitoring the performance on a separate validation set, and 
then stopping once performance starts to decrease; this is called early stopping. Alternatively, 
we can use model selection criteria such as AIC or BIC (see e.g., (Buhlmann and Hothorn 2007) 
for details). 

In practice, better (test set) performance can be obtained by performing “partial updates” of 
the form 


fim(&) = fin—1(&) + VBm P(X Ym) (16.35) 


Here 0 < v < 1 is a step-size parameter. In practice it is common to use a small value such as 
v = 0.1. This is called shrinkage. 

Below we discuss how to solve the suproblem in Equation 16.33. This will depend on the 
form of loss function. However, it is independent of the form of weak learner. 


L2boosting 


Suppose we used squared error loss. Then at step m the loss has the form 


L(yi, fm-1 (X1) + Bey) = (rim — (xi; Y))? (16.36) 
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Figure 16.10 Example of adaboost using a decision stump as a weak learner. The degree of blackness 
represents the confidence in the red class. The degree of whiteness represents the confidence in the blue 
class. The size of the datapoints represents their weight. Decision boundary is in yellow. (a) After 1 
round. (b) After 3 rounds. (c) After 120 rounds. Figure generated by boostingDemo, written by Richard 
Stapenhurst. 


where rim £ Yi — fm—1(%;) is the current residual, and we have set 8 = 1 without loss of 
generality. Hence we can find the new basis function by using the weak learner to predict rm. 
This is called L2boosting, or least squares boosting (Buhlmann and Yu 2003). In Section 16.4.6, 
we will see that this method, with a suitable choice of weak learner, can be made to give the 
same results as LARS, which can be used to perform variable selection (see Section 13.4.2). 


AdaBoost 


Consider a binary classification problem with exponential loss. At step m we have to minimize 


N N 
Em{¢) = 5 exp[—9i(fm—1(xi) + BO(x:))] = 5 Wim exp(—69i0(x:)) (16.37) 


i=l i=l 


where Wim £ exp(—9ifm—1(xi)) is a weight applied to datacase i, and J; € {—1, +1}. We 
can rewrite this objective as follows: 


Lm = e° X wimteh X Wim (16.38) 
Ji=¢(x:) Ji#olx:) 
N N 
= (eê — eP) X wimlGi + O(%i)) HEF Y wim (16.39) 
i=1 i=1 


Consequently the optimal function to add is 


Ọm = argmin Wi mI(Ji A O(x:)) (16.40) 
(o) 


This can be found by applying the weak learner to a weighted version of the dataset, with 
weights W; m. Subsituting m into Lm and solving for 8 we find 


B= lee — (16.41) 
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where 
N g 
Nap Mae 
poe EE (9 É Om(xi)) (16.42) 
Diz Wim 
The overall update becomes 
fim (x) = fim—1(X) T Bmom (x) (16.43) 
With this, the weights at the next iteration become 
Wim = Wi me Pm di om (Xi) (16.44) 
= w; mefr ión) (16.45) 
= Wy mero iF Om (Ki) ebm (16.46) 


where we exploited the fact that —9;¢(x;) = —1 if Ji = dm(xi) and —Jibdm(x;) = +1 
otherwise. Since e~° will cancel out in the normalization step, we can drop it. The result is 
the algorithm shown in Algorithm 7, known Adaboost.M1.* 

An example of this algorithm in action, using decision stumps as the weak learner, is given in 
Figure 16.10. We see that after many iterations, we can “carve out” a complex decision boundary. 
What is rather surprising is that AdaBoost is very slow to overfit, as is apparent in Figure 16.8. 
See Section 16.4.8 for a discussion of this point. 


Algorithm 16.2: Adaboost.Ml, for binary classification with exponential loss 
1w; =1/N; 
2 form = 1: M do 
3 Fit a classifier m (X) to the training set using weights w; 
Dia VivmlGiFAGm(xi)) , 
Dia Wim) i 
5 Compute &m = log|(1 — errm )/errml]; 
6 Set Wi Wi exp[Oml(G; # dm (x:))]; 


7 Return f(x) = sgn ae Am Pm(X) 5 


4 Compute err, = 


LogitBoost 


The trouble with exponential loss is that it puts a lot of weight on misclassified examples, as 
is apparent from the exponential blowup on the left hand side of Figure 16.9. This makes the 
method very sensitive to outliers (mislabeled examples). In addition, ef is not the logarithm 
of any pmf for binary variables 7 € {—1, +1}; consequently we cannot recover probability 
estimates from f(x). 


4. In (Friedman et al. 2000), this is called discrete AdaBoost, since it assumes that the base classifier øm returns a 
binary class label. If ¢ returns a probability instead, a modified algorithm, known as real AdaBoost, can be used. See 
(Friedman et al. 2000) for details. 


560 Chapter 16. Adaptive basis function models 


A natural alternative is to use logloss instead. This only punishes mistakes linearly, as is clear 
from Figure 16.9. Furthermore, it means that we will be able to extract probabilities from the 
final learned function, using 


ef (x) 1 
ply E 1x) ~ ef (x) + ef (x) ~ 1+ e72f (x) (16.47) 
The goal is to minimze the expected log-loss, given by 
N 
Lm($) = X log[1 + exp (—29i(fm—1(x) + 4(x:)))] (16.48) 
i=1 


By performing a Newton upate on this objective (similar to IRLS), one can derive the algorithm 
shown in Algorithm 8. This is known as logitBoost (Friedman et al. 2000). It can be generalized 
to the multi-class setting, as explained in (Friedman et al. 2000). 


Algorithm 16.3: LogitBoost, for binary classification with log-loss 
1 wi = 1/N, mi = 1/2; 

2 for m = 1: M do 

3 Compute the working response z; = Hr 

4 Compute the weights w; = m;(1 — ri); 

5 Pm = argming Da wilzi — o(x;))?; 

6 Update f(x) = f(x) + m(x); 

7 Compute 7; = 1/(1 + exp(—2f(x;))); 


8 Return f(x) = sgn Ee m(x) |; 


16.4.5 Boosting as functional gradient descent 


Rather than deriving new versions of boosting for every different loss function, it is possible to 
derive a generic version, known as gradient boosting (Friedman 2001; Mason et al. 2000). To 
explain this, imagine minimizing 


f = argmin L(f) (16.49) 
f 
where f = (f(x1),..., f(xn)) are the “parameters”. We will solve this stagewise, using gradient 


descent. At step m, let gm be the gradient of L(f) evaluated at f = fm—1: 


Jim = e h 


Gradients of some common loss functions are given in Table 16.1. We then make the update 


(16.50) 


fin =Imn-1 — Pm&Sm (16.51) 
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where pm is the step length, chosen by 


Pm = argmin L(f£m—1 — p&m) (16.52) 
p 


This is called functional gradient descent. 

In its current form, this is not much use, since it only optimizes f at a fixed set of N points, 
so we do not learn a function that can generalize. However, we can modify the algorithm by 
fitting a weak learner to approximate the negative gradient signal. That is, we use this update 


N 
Ym = argmin X` (—gim — $(%i;7))” (16.53) 


i=l 


The overall algorithm is summarized in Algorithm 6. (We have omitted the line search step, 
which is not strictly necessary, as argued in (Buhlmann and Hothorn 2007).) 


Algorithm 16.4: Gradient boosting 
1 Initialize fo(x) = argmin, = L(yi, (xiy); 
2 form = 1: M do 
OL (yi, f (xi) 


3 Compute the gradient residual using rim = — | VED) | we ( j 
É J(xi)=fm-1(Xi 


4 Use the weak learner to compute y, which minimizes E (rim = 006) Ym)”; 
Update fm(x) = fin—1(x) + ¥O(X3 Ym); 
6 Return f(x) = far(x) 


a 


If we apply this algorithm using squared loss, we recover L2Boosting. If we apply this 
algorithm to log-loss, we get an algorithm known as BinomialBoost (Buhlmann and Hothorn 
2007). The advantage of this over LogitBoost is that it does not need to be able to do weighted 
fitting: it just applies any black-box regression model to the gradient vector. Also, it is relatively 
easy to extend to the multi-class case (see (Hastie et al. 2009, p387)). We can also apply this 
algorithm to other loss functions, such as the Huber loss (Section 7.4), which is more robust to 
outliers than squared error loss. 


Sparse boosting 


Suppose we use as our weak learner the following algorithm: search over all possible variables 
j =1: D, and pick the one j(m) that best predicts the residual vector: 


N 
jim) = argmin (rim — Bymaig)? (16.54) 
Í i= 
N 
4 ;—1 Vijfim 
bim = Ziza TijTim (16.55) 
X= Tij 


Pm (x) = TA Tim) (16.56) 
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This method, which is known as sparse boosting (Buhlmann and Yu 2006), is identical to the 
matching pursuit algorithm discussed in Section 13.2.3.1. 

It is clear that this will result in a sparse estimate, at least if M is small. To see this, let us 
rewrite the update as follows: 


By yg FO, - - 50, Êjfm),m 0,- <0) (16.57) 


where the non-zero entry occurs in location j(m). This is known as forward stagewise linear 
regression (Hastie et al. 2009, p608), which becomes equivalent to the LAR algorithm discussed 
in Section 13.4.2 as v — 0. Increasing the number of steps m in boosting is analogous to 
decreasing the regularization penalty À. If we modify boosting to allow some variable deletion 
steps (Zhao and Yu 2007), we can make it equivalent to the LARS algorithm, which computes 
the full regularization path for the lasso problem. The same algorithm can be used for sparse 
logistic regression, by simply modifying the residual to be the appropriate negative gradient. 

Now consider a weak learner that is similar to the above, except it uses a smoothing spline 
instead of linear regression when mapping from x, to the residual. The result is a sparse 
generalized additive model (see Section 16.3). It can obviously be extended to pick pairs of 
variables at a time. The resulting method often works much better than MARS (Buhlmann and 
Yu 2006). 


Multivariate adaptive regression trees (MART) 


It is quite common to use CART models as weak learners. It is usually advisable to use a shallow 
tree, so that the variance is low. Even though the bias will be high (since a shallow tree is likely 
to be far from the “truth”), this will compensated for in subsequent rounds of boosting. 

The height of the tree is an additional tuning parameter (in addition to W, the number of 
rounds of boosting, and v, the shrinkage factor). Suppose we restrict to trees with J leaves. 
If J = 2, we get a stump, which can only split on a single variable. If J = 3, we allow for 
two-variable interactions, etc. In general, it is recommended (e.g., in (Hastie et al. 2009, p363) 
and (Caruana and Niculescu-Mizil 2006)) to use J ~ 6. 

If we combine the gradient boosting algorithm with (shallow) regression trees, we get a model 
known as MART, which stands for “multivariate adaptive regression trees”. This actually includes 
a slight refinement to the basic gradient boosting algorithm: after fitting a regression tree to the 
residual (negative gradient), we re-estimate the parameters at the leaves of the tree to minimize 
the loss: 


Yjm = argmin 5 L(yi, fm-1(X:) +7) (16.58) 
7 tiCRjm 


where Fj. is the region for leaf j in the m’th tree, and yj, is the corresponding parameter (the 
mean response of y for regression problems, or the most probable class label for classification 
problems). 


Why does boosting work so well? 


We have seen that boosting works very well, especially for classifiers. There are two main 
reasons for this. First, it can be seen as a form of 44 regularization, which is known to help 
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prevent overfitting by eliminating “irrelevant” features. To see this, imagine pre-computing all 
possible weak-learners, and defining a feature vector of the form #(x) = [91(x),..., dx (x)]. 
We could use ¢; regularization to select a subset of these. Alternatively we can use boosting, 
where at each step, the weak learner creates a new ¢, on the fly. It is possible to combine 
boosting and ¢, regularization, to get an algorithm known as Ll-Adaboost (Duchi and Singer 
2009). Essentially this method greedily adds the best features (weak learners) using boosting, 
and then prunes off irrelevant ones using 44 regularization. 

Another explanation has to do with the concept of margin, which we introduced in Sec- 
tion 14.5.2.2. (Schapire et al. 1998; Ratsch et al. 2001) proved that AdaBoost maximizes the 
margin on the training data. (Rosset et al. 2004) generalized this to other loss functions, such as 
log-loss. 


A Bayesian view 


So far, our presentation of boosting has been very frequentist, since it has focussed on greedily 
minimizing loss functions. A likelihood interpretation of the algorithm was given in (Neal and 
MacKay 1998; Meek et al. 2002). The idea is to consider a mixture of experts model of the form 


M 
P(ylx, 0) = XÙ Tmplylx, Ym) (16.59) 
m=1 
where each expert p(y|x, Ym) is like a weak learner. We usually fit all M experts at once 
using EM, but we can imagine a sequential scheme, whereby we only update the parameters 
for one expert at a time. In the E step, the posterior responsibilities will reflect how well the 
existing experts explain a given data point; if this is a poor fit, these data points will have 
more influence on the next expert that is fitted. (This view naturally suggest a way to use a 
boosting-like algorithm for unsupervised learning: we simply sequentially fit mixture models, 
instead of mixtures of experts.) 

Notice that this is a rather “broken” MLE procedure, since it never goes back to update the 
parameters of an old expert. Similarly, if boosting ever wants to change the weight assigned to a 
weak learner, the only way to do this is to add the weak learner again with a new weight. This 
can result in unnecessarily large models. By contrast, the BART model (Chipman et al. 2006, 
2010) uses a Bayesian version of backfitting to fit a small sum of weak learners (typically trees). 


Feedforward neural networks (multilayer perceptrons) 


A feedforward neural network, aka multi-layer perceptron (MLP), is a series of logistic 
regression models stacked on top of each other, with the final layer being either another logistic 
regression or a linear regression model, depending on whether we are solving a classification or 
regression problem. For example, if we have two layers, and we are solving a regression problem, 
the model has the form 


p(y|x,@) = N(ylw"2(x),07) (16.60) 
2(x) = g(Vx) = [9(v{x),...,9(vizx)] (16.61) 


where g is a non-linear activation or transfer function (commonly the logistic function), 
z(x) = (x, V) is called the hidden layer (a deterministic function of the input), H is the 
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Figure 16.11 A neural network with one hidden layer. 


number of hidden units, V is the weight matrix from the inputs to the hidden nodes, and 
w is the weight vector from the hidden nodes to the output. It is important that g be non- 
linear, otherwise the whole model collapses into a large linear regression model of the form 
y = wT (Vx). One can show that an MLP is a universal approximator, meaning it can model 
any suitably smooth function, given enough hidden units, to any desired level of accuracy 
(Hornik 1991). 

To handle binary classification, we pass the output through a sigmoid, as in a GLM: 


p(yl|x, 0) = Ber(y|sigm(w7 z(x))) (16.62) 


We can easily extend the MLP to predict multiple outputs. For example, in the regression case, 
we have 


p(y|x, 0) =N(y|W (x, V), 071) (16.63) 


See Figure 16.11 for an illustration. If we add mutual inhibition arcs between the output units, 
ensuring that only one of them turns on, we can enforce a sum-to-one constraint, which can be 
used for multi-class classification. The resulting model has the form 


P(y|x,8) = Cat(y|S(Wz(x)) (16.64) 


Convolutional neural networks 


The purpose of the hidden units is to learn non-linear combinations of the original inputs; this 
is called feature extraction or feature construction. These hidden features are then passed as 
input to the final GLM. This approach is particularly useful for problems where the original input 
features are not very individually informative. For example, each pixel in an image is not very 
informative; it is the combination of pixels that tells us what objects are present. Conversely, for 
a task such as document classification using a bag of words representation, each feature (word 
count) is informative on its own, so extracting “higher order” features is less important. Not 
suprisingly, then, much of the work in neural networks has been motivated by visual pattern 
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Figure 16.12 The convolutional neural network from (Simard et al. 2003). Source: http: //www.codep 
roject.com/KB/library/NeuralNetRecognition.aspx . Used with kind permission of Mike O'Neill. 


recognition (e.g., (LeCun et al. 1989)), although they have also been applied to other types of 
data, including text (e.g., (Collobert and Weston 2008). 

A form of MLP which is particularly well suited to 1d signals like speech or text, or 2d signals 
like images, is the convolutional neural network. This is an MLP in which the hidden units 
have local receptive fields (as in the primary visual cortex), and in which the weights are tied 
or shared across the image, in order to reduce the number of parameters. Intuitively, the effect 
of such spatial parameter tying is that any useful features that are “discovered” in some portion 
of the image can be re-used everywhere else without having to be independently learned. The 
resulting network then exhibits translation invariance, meaning it can classify patterns no 
matter where they occur inside the input image. 

Figure 16.12 gives an example of a convolutional network, designed by Simard and colleagues 
(Simard et al. 2003), with 5 layers (4 layers of adjustable parameters) designed to classify 29 x 29 
gray-scale images of handwritten digits from the MNIST dataset (see Section 1.2.1.3). In layer 1, 
we have 6 feature maps each of which has size 13 x 13. Each hidden node in one of these 
feature maps is computed by convolving the image with a 5 x 5 weight matrix (sometimes called 
a kernel), adding a bias, and then passing the result through some form of nonlinearity. There 
are therefore 13 x 13 x 6 = 1014 neurons in Layer 1, and (5 x 5+ 1) x 6 = 156 weights. (The 
"+l" is for the bias.) If we did not share these parameters, there would be 1014 x 26 = 26, 364 
weights at the first layer. In layer 2, we have 50 feature maps, each of which is obtained by 
convolving each feature map in layer 1 with a 5 x 5 weight matrix, adding them up, adding a 
bias, and passing through a nonlinearity. There are therefore 5 x 5 x 50 = 1250 neurons in 
Layer 2, (5 x 5+ 1) x 6 x 50 = 7800 adjustable weights (one kernel for each pair of feature 
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maps in layers 1 and 2), and 1250 x 26 = 32,500 connections. Layer 3 is fully connected to 
layer 2, and has 100 neurons and 100 x (1250 + 1) = 125, 100 weights. Finally, layer 4 is also 
fully connected, and has 10 neurons, and 10 x (100 + 1) = 1010 weights. Adding the above 
numbers, there are a total of 3,215 neurons, 134,066 adjustable weights, and 184,974 connections. 

This model is usually trained using stochastic gradient descent (see Section 16.5.4 for details). 
A single pass over the data set is called an epoch. When Mike O'Neill did these experiments in 
2006, he found that a single epoch took about 40 minutes (recall that there are 60,000 training 
examples in MNIST). Since it took about 30 epochs for the error rate to converge, the total 
training time was about 20 hours. Using this technique, he obtained a misclassification rate on 
the 10,000 test cases of about 1.40%. 

To further reduce the error rate, a standard trick is to expand the training set by including 
distorted versions of the original data, to encourage the network to be invariant to small changes 
that don’t affect the identity of the digit. These can be created by applying a random flow field 
to shift pixels around. See Figure 16.13 for some examples. (If we use online training, such as 
stochastic gradient descent, we can create these distortions on the fly, rather than having to 
store them.) Using this technique, Mike O’Neill obtained a misclassification rate on the 10,000 
test cases of about 0.74%, which is close to the current state of the art.® 

Yann Le Cun and colleagues (LeCun et al. 1998) obtained similar performance using a slightly 
more complicated architecture shown in Figure 16.14. This model is known as LeNet5, and 
historically it came before the model in Figure 16.12. There are two main differences. First, 
LeNet5 has a subsampling layer between each convolutional layer, which either averages or 
computes the max over each small window in the previous layer, in order to reduce the size, and 
to obtain a small amount of shift invariance. The convolution and sub-sampling combination 
was inspired by Hubel and Wiesel’s model of simple and complex cells in the visual cortex 
(Hubel and Wiesel 1962), and it continues to be popular in neurally-inspired models of visual 
object recognition (Riesenhuber and Poggio 1999). A similar idea first appeared in Fukushima’s 
neocognitron (Fukushima 1975), though no globally supervised training algorithm was available 
at that time. 

The second difference between LeNet5 and the Simard architecture is that the final layer is 
actually an RBF network rather than a more standard sigmoidal or softmax layer. This model 
gets a test error rate of about 0.95% when trained with no distortions, and 0.8% when trained 
with distortions. Figure 16.15 shows all 82 errors made by the system. Some are genuinely 
ambiguous, but several are errors that a person would never make. A web-based demo of the 
LeNet5 can be found at http://yann.lecun.com/exdb/lenet/index.html. 

Of course, classifying isolated digits is of limited applicability: in the real world, people usually 
write strings of digits or other letters. This requires both segmentation and classification. Le Cun 
and colleagues devised a way to combine convolutional neural networks with a model similar 
to a conditional random field (described in Section 19.6) to solve this problem. The system 
was eventually deployed by the US postal service. (See (LeCun et al. 1998) for a more detailed 
account of the system, which remains one of the best performing systems for this task.) 


5. Implementation details: Mike used C++ code and a variety of speedup tricks. He was using standard 2006 era 
hardware (an Intel Pentium 4 hyperthreaded processor running at 2.8GHz). See http: //www.codeproject.com/KB/ 
library/NeuralNetRecognition.aspx for details. 

6. A list of various methods, along with their misclassification rates on the MNIST test set, is available from http: 
//yann.lecun.com/exdb/mnist/. Error rates within 0.1-0.2% of each other are not statistically significantly different. 
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13 Several synthetic warpings of a handwritten digit. Based on Figure 5.14 of (Bishop 2006a). 


Figure generated by elasticDistortionsDemo, written by Kevin Swersky. 
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Figure 16.14 LeNet5, a convolutional neural net for classifying handwritten digits. 


(LeCun et al. 1998) . Used with kind permission of Yann LeCun. 
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Figure 16.15 These are the 82 errors made by LeNet5 on the 10,000 test cases of MNIST. Below each 
image is a label of the form correct-label —> estimated-label. Source: Figure 8 of (LeCun et al. 1998). 
Used with kind permission of Yann LeCun. (Compare to Figure 28.4(b) which shows the results of a deep 
generative model.) 


Other kinds of neural networks 


Other network topologies are possible besides the ones discussed above. For example, we can 
have skip arcs that go directly from the input to the output, skipping the hidden layer; we 
can have sparse connections between the layers; etc. However, the MLP always requires that 
the weights form a directed acyclic graph. If we allow feedback connections, the model is 
known as a recurrent neural network; this defines a nonlinear dynamical system, but does 
not have a simple probabilistic interpretation. Such RNN models are currently the best approach 
for language modeling (i.e., performing word prediction in natural language) (Tomas et al. 2011), 
significantly outperforming the standard n-gram-based methods discussed in Section 17.2.2. 

If we allow symmetric connections between the hidden units, the model is known as a Hop- 
field network or associative memory; its probabilistic counterpart is known as a Boltzmann 
machine (see Section 27.7) and can be used for unsupervised learning. 


A brief history of the field 


Neural networks have been the subject of great interest for many decades, due to the desire to 
understand the brain, and to build learning machines. It is not possible to review the entire 
history here. Instead, we just give a few “edited highlights”. 

The field is generally viewed as starting with McCulloch and Pitts (McCullich and Pitts 1943), 
who devised a simple mathematical model of the neuron in 1943, in which they approximated the 
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output as a weighted sum of inputs passed through a threshold function, y = (>, wixi > 0), 
for some threshold 6. This is similar to a sigmoidal activation function. Frank Rosenblatt 
invented the perceptron learning algorithm in 1957, which is a way to estimate the parameters of 
a McCulloch-Pitts neuron (see Section 8.5.4 for details). A very similar model called the adaline 
(for adaptive linear element) was invented in 1960 by Widrow and Hoff. 

In 1969, Minsky and Papert (Minsky and Papert 1969) published a famous book called “Percep- 
trons” in which they showed that such linear models, with no hidden layers, were very limited 
in their power, since they cannot classify data that is not linearly separable. This considerably 
reduced interest in the field. 

In 1986, Rumelhart, Hinton and Williams (Rumelhart et al. 1986) discovered the backpropa- 
gation algorithm (see Section 16.5.4), which allows one to fit models with hidden layers. (The 
backpropagation algorithm was originally discovered in (Bryson and Ho 1969), and independently 
in (Werbos 1974); however, it was (Rumelhart et al. 1986) that brought the algorithm to people’s 
attention.) This spawned a decade of intense interest in these models. 

In 1987, Sejnowski and Rosenberg (Sejnowski and Rosenberg 1987) created the famous NETtalk 
system, that learned a mapping from English words to phonetic symbols which could be fed 
into a speech synthesizer. An audio demo of the system as it learns over time can be found at 
http: //www.cnl.salk.edu/ParallelNetsPronounce/nettalk.mp3. The systems starts by 
“babbling” and then gradually learns to pronounce English words. NETtalk learned a distributed 
representation (via its hidden layer) of various sounds, and its success spawned a big debate in 
psychology between connectionism, based on neural networks, and computationalism, based 
on syntactic rules. This debate lives on to some extent in the machine learning community, 
where there are still arguments about whether learning is best performed using low-level, “neural- 
like” representations, or using more structured models. 

In 1989, Yann Le Cun and others (LeCun et al. 1989) created the famous LeNet system described 
in Section 16.5.1. 

In 1992, the support vector machine (see Section 14.5) was invented (Boser et al. 1992). SVMs 
provide similar prediction accuracy to neural networks while being considerably easier to train 
(since they use a convex objective function). This spawned a decade of interest in kernel methods 
in general.’ Note, however, that SVMs do not use adaptive basis functions, so they require a fair 
amount of human expertise to design the right kernel function. 

In 2002, Geoff Hinton invented the contrastive divergence training procedure (Hinton 2002), 
which provided a way, for the first time, to learn deep networks, by training one layer at a time 
in an unsupervised fashion (see Section 27.7.2.4 for details). This in turn has spawned renewed 
interest in neural networks over the last few years (see Chapter 28). 


The backpropagation algorithm 


Unlike a GLM, the NLL of an MLP is a non-convex function of its parameters. Nevertheless, 
we can find a locally optimal ML or MAP estimate using standard gradient-based optimization 
methods. Since MLPs have lots of parameters, they are often trained on very large data sets. 


7. It became part of the folklore during the 1990s that to get published in the top machine learning conference known as 
NIPS, which stands for “neural information processing systems”, it was important to ensure your paper did not contain 
the word “neural network”! 
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Figure 16.16 Two possible activation functions. tanh maps R to [—1, +1] and is the preferred nonlin- 
earity for the hidden nodes. sigm maps R to [0,1] and is the preferred nonlinearity for binary nodes at 
the output layer. Figure generated by tanhPlot. 


Consequently it is common to use first-order online methods, such as stochastic gradient descent 
(Section 8.5.2), whereas GLMs are usually fit with IRLS, which is a second-order offline method. 

We now discuss how to compute the gradient vector of the NLL by applying the chain rule of 
calculus. The resulting algorithm is known as backpropagation, for reasons that will become 
apparent. 

For notational simplicity, we shall assume a model with just one hidden layer. It is helpful 
to distinguish the pre- and post-synaptic values of a neuron, that is, before and after we apply 
the nonlinearity. Let x,, be the n’th input, an = Vx,, be the pre-synaptic hidden layer, and 
Zn, = g(a) be the post-synaptic hidden layer, where g is some transfer function. We typically 
use g(a) = sigm(a), but we may also use g(a) = tanh(a): see Figure 16.16 for a comparison. 
(When the input to sigm or tanh is a vector, we assume it is applied component-wise.) 

We now convert this hidden layer to the output layer as follows. Let b, = Wz, be the 
pre-synaptic output layer, and y,, = h(b,,) be the post-synaptic output layer, where h is 
another nonlinearity, corresponding to the canonical link for the GLM. (We reserve the notation 
Yn, without the hat, for the output corresponding to the n’th training case.) For a regression 
model, we use h(b) = b; for binary classifcation, we use h(b) = [sigm(b:),...,sigm(b.)]; for 
multi-class classification, we use h(b) = S(b). 

We can write the overall model as follows: 


xn hie $ zn S br ŷn (16.65) 


The parameters of the model are 0 = (V, W), the first and second layer weight matrices. Offset 
or bias terms can be accomodated by clamping an element of x,, and Zn to 1.8 


8. In the regression setting, we can easily estimate the variance of the output noise using the empirical variance of the 


residual errors, &? = 498) — y||?, after training is complete. There will be one value of a? for each output node, 


if we are performing multi-target regression, as we usually assume. 
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In the regression case, with K outputs, the NLL is given by the squared error: 
-XO SS Gnw (9) — Yn)? (16.66) 
n k 
In the classification case, with K classes, the NLL is given by the cross entropy 


= 5 5 Ynk log ng (A) (16.67) 
n k 


Our task is to compute VgJ. We will derive this for each n separately; the overall gradient is 
obtained by summing over n, although often we just use a mini-batch (see Section 8.5.2). 
Let us start by considering the output layer weights. We have 


OS, OJ, 
Vw.idn = = Vwb. ~~ 16.68 
Wk ~ Ibak wkÌnk = Ibni Zn ( ) 
since bpk = WE Zn. Assuming h is the canonical link function for the output GLM, then 
Equation 9.91 tells us that 
OJ, 5 
T £ Sik = (Gnk — Ynk) (16.69) 
bnk 
which is the error signal. So the overall gradient is 
Vow, Jn = 0”, Zn (16.70) 
which is the pre-synaptic input to the output layer, namely Zn, times the error signal, namely 
OMe 
For the input layer weights, we have 
OI n, A 


where we exploited the fact that a,j; = V Kn: All that remains is to compute the first level 
error signal ô, j. We have 


OIn OIn Obnk Obnk 

on; = On 16.72 

ni dany = bnk Panj 3 Oi ee) 
Now 

bnk = 5 Wkjglanj) (16.73) 
so 

bnk 

Tes = Wkjg' (anj) (16.74) 
where g'(a) = g(a). For tanh units, g'(a) = # tanh(a) = 1 — tanh? (a) = sech? (a), and 
for sigmoid ini eG = #a(a) = o(a)(1 — o(a)). Hence 


K 
S55 wej 9! (anj) (16.75) 
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Thus the layer 1 errors can be computed by passing the layer 2 errors back through the W matrix; 
hence the term “backpropagation”. The key property is that we can compute the gradients locally: 
each node only needs to know about its immediate neighbors. This is supposed to make the 
algorithm “neurally plausible”, although this interpretation is somewhat controversial. 

Putting it all together, we can compute all the gradients as follows: we first perform a 
forwards pass to compute an, Zn, Þn and ŷn. We then compute the error for the output layer, 


6(2) = Yn — Yn, which we pass backwards through W using Equation 16.75 to compute the 
error for the hidden layer, 6), We then compute the overall gradient as follows: 


VoJ(0) = > -[52xn, On Zn] (16.76) 


n 


Identifiability 


It is easy to see that the parameters of a neural network are not identifiable. For example, we 
can change the sign of the weights going into one of the hidden units, so long as we change 
the sign of all the weights going out of it; these effects cancel, since tanh is an odd function, so 
tanh(—a) = —tanh(a). There will be H such sign flip symmetries, leading to 2% equivalent 
settings of the parameters. Similarly, we can change the identity of the hidden units without 
affecting the likelihood. There are H! such permutations. The total number of equivalent 
parameter settings (with the same likelihood) is therefore H 127, 

In addition, there may be local minima due to the non-convexity of the NLL. This can 
be a more serious problem, although with enough data, these local optima are often quite 
“shallow”, and simple stochastic optimization methods can avoid them. In addition, it is common 
to perform multiple restarts, and to pick the best solution, or to average over the resulting 
predictions. (It does not make sense to average the parameters themselves, since they are not 


identifiable.) 


Regularization 


As usual, the MLE can overfit, especially if the number of nodes is large. A simple way to prevent 
this is called early stopping, which means stopping the training procedure when the error on 
the validation set first starts to increase. This method works because we usually initialize from 
small random weights, so the model is initially simple (since the tanh and sigm functions are 
nearly linear near the origin). As training progresses, the weights become larger, and the model 
becomes nonlinear. Eventually it will overfit. 

Another way to prevent overfitting, that is more in keeping with the approaches used elsewhere 
in this book, is to impose a prior on the parameters, and then use MAP estimation. It is standard 
to use a N (0, a~1T) prior (equivalent to /> regularization), where a is the precision (strength) 
of the prior. In the neural networks literature, this is called weight decay, since it encourages 
small weights, and hence simpler models. The penalized NLL objective becomes 


N 
J(0) = — $ | log p(Yn| Xn; 8) + D vy +Y wil (16.77) 


n=1 ij jk 
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(Note that we don't penalize the bias terms.) The gradient of the modified objective becomes 


Vo J (0) = [X 62xXn tav, X` 8r Zn + aw] (16.78) 


as in Section 8.3.6. 

If the regularization is sufficiently strong, it does not matter if we have too many hidden units 
(apart from wasted computation). Hence it is advisable to set H to be as large as you can afford 
(say 10-100), and then to choose an appropriate regularizer. We can set the œ parameter by 
cross validation or empirical Bayes (see Section 16.5.7.5). 

As with ridge regression, it is good practice to standardize the inputs to zero mean and unit 
variance, so that the spherical Gaussian prior makes sense. 


Consistent Gaussian priors * 


One can show (MacKay 1992) that using the same regularization parameter for both the first and 
second layer weights results in the lack of a certain desirable invariance property. In particular, 
suppose we linearly scale and shift the inputs and/or outputs to a neural network regression 
model. Then we would like the model to learn to predict the same function, by suitably scaling 
its internal weights and bias terms. However, the amount of scaling needed by the first and 
second layer weights to compensate for a change in the inputs and/or outputs is not the same. 
Therefore we need to use a different regularization strength for the first and second layer. 
Fortunately, this is easy to do — we just use the following prior: 


p0) = N(W|0, DN (V0, DN (blo, HIA (clo, +1) (16.79) 
Aw Ay Ap Ae 


where b and c are the bias terms.’ 

To get a feeling for the effect of these hyper-parameters, we can sample MLP parameters 
from this prior and plot the resulting random functions. Figure 16.17 shows some examples. 
Decreasing &, allows the first layer weights to get bigger, making the sigmoid-like shape of 
the functions steeper. Decreasing a» allows the first layer biases to get bigger, which allows 
the center of the sigmoid to shift left and right more. Decreasing a, allows the second layer 
weights to get bigger, making the functions more “wiggly” (greater sensitivity to change in the 
input, and hence larger dynamic range). And decreasing a, allows the second layer biases to 
get bigger, allowing the mean level of the function to move up and down more. (In Chapter 15, 
we will see an easier way to define priors over functions.) 


Weight pruning 


Since there are many weights in a neural network, it is often helpful to encourage sparsity. 
Various ad-hoc methods for doing this, with names such as “optimal brain damage”, were 
devised in the 1990s; see e.g., (Bishop 1995) for details. 


9. Since we are regularizing the output bias terms, it is helpful, in the case of regression, to normalize the target 
responses in the training set to zero mean, to be consistent with the fact that the prior on the output bias has zero 
mean. 
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Figure 16.17 The effects of changing the hyper-parameters on an MLP. (a) Default parameter values 
ay = 0.01, ag = 0.1, aw = 1, ae = 1. (b) Decreasing a, by factor of 10. (c) Decreasing a, by 
factor of 10. (d) Decreasing a, by factor of 10. (e) Decreasing a. by factor of 10. Figure generated by 


mlpPriorsDemo. 
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O Data 
= Deep Neural Net 


(a) 


Figure 16.18 (a) A deep but sparse neural network. The connections are pruned using 41 regularization. 
At each level, nodes numbered 0 are clamped to 1, so their outgoing weights correspond to the offset/bias 
terms. (b) Predictions made by the model on the training set. Figure generated by sparseNnetDemo, 
written by Mark Schmidt. 


However, we can also use the more principled sparsity-promoting techniques we discussed in 
Chapter 13. One approach is to use an ¢; regularizer. See Figure 16.18 for an example. Another 
approach is to use ARD; this is discussed in more detail in Section 16.5.7.5. 


Soft weight sharing* 


Another way to regularize the parameters is to encourage similar weights to share statistical 
strength. But how do we know which parameters to group together? We can learn this, by using 
a mixture model. That is, we model p(@) as a mixture of (diagonal) Gaussians. Parameters that 
are assigned to the same cluster will share the same mean and variance and thus will have 
similar values (assuming the variance for that cluster is low). This is called soft weight sharing 
(Nowlan and Hinton 1992). In practice, this technique is not widely used. See e.g., (Bishop 2006a, 
p271) if you want to know the details. 


Semi-supervised embedding * 


An interesting way to regularize “deep” feedforward neural networks is to encourage the hidden 
layers to assign similar objects to similar representations. This is useful because it is often easy 
to obtain “side” information consisting of sets of pairs of similar and dissimilar objects. For 
example, in a video classification task, neighboring frames can be deemed similar, but frames 
that are distant in time can be deemed dis-similar (Mobahi et al. 2009). Note that this can be 
done without collecting any labels. 

Let S;; = 1 if examples i and j are similar, and S;; = 0 otherwise. Let f(x;) be some 
embedding of item x;, e.g, f(x;) = z(x;, 0), where z is the hidden layer of a neural network. 
Now define a loss function L(f(x:), f (x4), Sij) that depends on the embedding of two objects, 
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and the observed similarity measure. For example, we might want to force similar objects to 
have similar embeddings, and to force the embeddings of dissimilar objects to be a minimal 
distance apart: 


lf, — £l? if Si; = 1 


L (fi, fj, Sig) ={ max(0,m — ||f; —£)||2) if Sjy =0 (e0) 


where m is some minimal margin. We can now define an augmented loss function for training 
the neural network: 


XO NLL(F (x), ys) +A JO L), FE), Sis) (16.81) 


i€L i, jeu 


where £ is the labeled training set, U is the unlabeled training set, and A > 0 is some tradeoff 
parameter. This is called semi-supervised embedding (Weston et al. 2008). 

Such an objective can be easily optimized by stochastic gradient descent. At each itera- 
tion, pick a random labeled training example, (Xn, Yn), and take a gradient step to optimize 
NLL(f(x;), yi). Then pick a random pair of similar unlabeled examples x;, x; (these can 
sometimes be generated on the fly rather than stored in advance), and make a gradient step to 
optimize AL(f(x;), f(x;), 1), Finally, pick a random unlabeled example xx, which with high 
probability is dissimilar to x;, and make a gradient step to optimize AL(f(x;), f (xx), 0). 

Note that this technique is effective because it can leverage massive amounts of data. In 
a related approach, (Collobert and Weston 2008) trained a neural network to distinguish valid 
English sentences from invalid ones. This was done by taking all 631 million words from English 
Wikipedia (en.wikipedia.org), and then creating windows of length 11 containing neighboring 
words. This constitutes the positive examples. To create negative examples, the middle word of 
each window was replaced by a random English word (this is likely to be an “invalid” sentence 
— either grammatically and/or semantically — with high probability). This neural network was 
then trained over the course of 1 week, and its latent representation was then used as the input 
to a supervised semantic role labeling task, for which very little labeled training data is available. 
(See also (Ando and Zhang 2005) for related work.) 


Bayesian inference * 


Although MAP estimation is a succesful way to reduce overfitting, there are still some good 
reasons to want to adopt a fully Bayesian approach to “fitting” neural networks: 


e Integrating out the parameters instead of optimizing them is a much stronger form of regu- 
larization than MAP estimation. 


e We can use Bayesian model selection to determine things like the hyper-parameter settings 
and the number of hidden units. This is likely to be much faster than cross validation, 
especially if we have many hyper-parameters (e.g., as in ARD). 


e Modelling uncertainty in the parameters will induce uncertainty in our predictive distribu- 
tions, which is important for certain problems such as active learning and risk-averse decision 
making. 
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e We can use online inference methods, such as the extended Kalman filter, to do online 
learning (Haykin 2001). 


One can adopt a variety of approximate Bayesian inference techniques in this context. In this 
section, we discuss the Laplace approximation, first suggested in (MacKay 1992, 1995b). One can 
also use hybrid Monte Carlo (Neal 1996), or variational Bayes (Hinton and Camp 1993; Barber 
and Bishop 1998). 


Parameter posterior for regression 


We start by considering regression, following the presentation of (Bishop 2006a, sec 5.7), which 
summarizes the work of (MacKay 1992, 1995b). We will use a prior of the form p(w) = 
N (w]0, (1/a)I), where w represents all the weights combined. We will denote the precision 
of the noise by 8 = 1/o?. 

The posterior can be approximated as follows: 


p(w|D,a,8) œx exp(—E(w)) (16.82) 
E(w) = BEp(w) +aLw(w) (16.83) 
N 
ae 8 
Ep(w) = 5) 0n- fn w) (16.84) 
n=1 
Ew(w) ê iww (16.85) 


where Ep is the data error, Æw is the prior error, and F is the overall error (negative log 
prior plus log likelihood). Now let us make a second-order Taylor series approximation of E(w) 
around its minimum (the MAP estimate) 


1 
E(w) x E(wup) + z” = wap)! A(w —wmp) (16.86) 


where A is the Hessian of E: 
A= VVE(wmpP) = bH +al (16.87) 


where H = VVEp(wwp) is the Hessian of the data error. This can be computed exactly 
in O(d?) time, where d is the number of parameters, using a variant of backpropagation (see 
(Bishop 2006a, sec 5.4) for details). Alternatively, if we use a quasi-Newton method to find 
the mode, we can use its internally computed (low-rank) approximation to H. (Note that 
diagonal approximations of H are usually very inaccurate.) In either case, using this quadratic 
approximation, the posterior becomes Gaussian: 


p(wla, B,D) ~ N(wl|wwp,Aq') (16.88) 
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Parameter posterior for classification 


The classification case is the same as the regression case, except 8 = 1 and Ep is a cross- 
entropy error of the form 


N 
Ep(w) © X lyn In f(%n,w) + (1 yn) In fxn, w)] (16.89) 
i (16.90) 
Predictive posterior for regression 
The posterior predictive density is given by 
(ylx.D, 0,8) = f NULE w), 1/8) (wwa, A“) dw 46.9) 


This is not analytically tractable because of the nonlinearity of f(x,w). Let us therefore 
construct a first-order Taylor series approximation around the mode: 


f(x,w) © f(x, wap) +g! (w- wp) (16.92) 
where 
& = Vwi (x, W)|wewa (16.93) 


We now have a linear-Gaussian model with a Gaussian prior on the weights. From Equation 4.126 
we have 


p(y|x, D,a, 8) ~N (yl f(x, wmp),o7(x)) (16.94) 
where the predictive variance depends on the input x as follows: 
o(x) =p +7 Ag (16.95) 


The error bars will be larger in regions of input space where we have little training data. See 
Figure 16.19 for an example. 


Predictive posterior for classification 


In this section, we discuss how to approximate p(y|x,D) in the case of binary classification. 
The situation is similar to the case of logistic regression, discussed in Section 8.4.4, except in 
addition the posterior predictive mean is a non-linear function of w. Specifically, we have 
u = E[y|x, w] = sigm(a(x,w)), where a(x, w) is the pre-synaptic output of the final layer. 
Let us make a linear approximation to this: 


a(x, w) x am p(x) + g’ (w = WMP) (16.96) 


where amp(x) = a(x,wmp) and g = Vxa(x,wmp) can be found by a modified version of 
backpropagation. Clearly 


p(a|x, D) ~ N (a(x, warp), g(x)’ AW 'g(x)) (16.97) 
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Figure 16.19 The posterior predictive density for an MLP with 3 hidden nodes, trained on 16 data points. 
The dashed green line is the true function. (a) Result of using a Laplace approximation, after performing 
empirical Bayes to optimize the hyperparameters. The solid red line is the posterior mean prediction, 
and the dotted blue lines are 1 standard deviation above and below the mean. Figure generated by 
mlpRegEvidenceDemo. (b) Result of using hybrid Monte Carlo, using the same trained hyperparameters 
as in (a). The solid red line is the posterior mean prediction, and the dotted blue lines are samples from 
the posterior predictive. Figure generated by mlpRegHmcDemo, written by Ian Nabney. 


Hence the posterior predictive for the output is 
p(y = 1|x,D) = J sem(apla\x,D)da x sigm(k(o2)b” warp) (16.98) 


where « is defined by Equation 8.70, which we repeat here for convenience: 


Ni 


klo) & (1407/8) (16.99) 


Of course, a simpler (and potentially more accurate) alternative to this is to draw a few samples 
from the Gaussian posterior and to approximate the posterior predictive using Monte Carlo. 

In either case, the effect of taking uncertainty of the parameters into account, as in Sec- 
tion 8.4.4, is to “moderate” the confidence of the output; the decision boundary itself is unaf- 
fected, however. 


ARD for neural networks 


Once we have made the Laplace approximation to the posterior, we can optimize the marginal 
likelihood wrt the hyper-parameters œ using the same fixed-point equations as in Section 13.7.4.2. 
Typically we use one hyper-parameter for the weight vector leaving each node, to achieve an 
effect similar to group lasso (Section 13.5.1). That is, the prior has the form 


D 
=] [vv nil0, =— ajvo w.,;|0, > (16.100) 
j=j w,j 


If we find a, ; = 00, then input feature 7 is irrelevant, and its weight vector v. ; is pruned out. 
Similarly, if we find a,,,; = oo, then hidden feature j is irrelevant. This is known as automatic 
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relevancy determination or ARD, which was discussed in detail in Section 13.7. Applying this to 
neural networks gives us an efficient means of variable selection in non-linear models. 

The software package NETLAB contains a simple example of ARD applied to a neural network, 
called demard. This demo creates some data according to a nonlinear regression function 
f(£1, £2, £3) = sin(27a,) + €, where zo is a noisy copy of xı. We see that x2 and x3 are 
irrelevant for predicting the target. However, x2 is correlated with xı, which is relevant. Using 
ARD, the final hyper-parameters are as follows: 


a = (0.2, 21.4, 249001.8] (16.101) 


This clearly indicates that feature 3 is irrelevant, feature 2 is only weakly relevant, and feature 1 
is very relevant. 


Ensemble learning 


Ensemble learning refers to learning a weighted combination of base models of the form 


Fux r) = X wmim(ylx) (16.102) 


meM 


where the wm are tunable parameters. Ensemble learning is sometimes called a committee 
method, since each base model fm gets a weighted “vote”. 

Clearly ensemble learning is closely related to learning adaptive-basis function models. In 
fact, one can argue that a neural net is an ensemble method, where fm represents the m’th 
hidden unit, and wm are the output layer weights. Also, we can think of boosting as kind of 
ensemble learning, where the weights on the base models are determined sequentially. Below 
we describe some other forms of ensemble learning. 


Stacking 


An obvious way to estimate the weights in Equation 16.102 is to use 


N M 
W = argmin 5 L(yi, 5 Wm fm(X)) (16.103) 
wo i=l m=1 


However, this will result in overfitting, with wm being large for the most complex model. A 
simple solution to this is to use cross-validation. In particular, we can use the LOOCV estimate 


N M 
w= argmin X` L(y, > Wm fr (x)) (16.104) 
wo isl 


m=1 


where fri (x) is the predictor obtained by training on data excluding (x;, yi). This is known 
as stacking, which stands for “stacked generalization” (Wolpert 1992). This technique is more 
robust to the case where the “true” model is not in the model class than standard BMA (Clarke 
2003). This approach was used by the Netflix team known as “The Ensemble”, which tied the 
submission of the winning team (BellKor’s Pragmatic Chaos) in terms of accuracy (Sill et al. 
2009). Stacking has also been used for problems such as image segmentation and labeling. 
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Class Ci Co C3 Ca Cs Ce eas C15 
0 1 1 0 0 0 0 nee I 
1 0 0 1 1 1 1 0 
9 0 1 1 1 0 0 0 


Table 16.2 Part of a 15-bit error-correcting output code for a 10-class problem. Each row defines a 
two-class problem. Based on Table 16.1 of (Hastie et al. 2009). 


Error-correcting output codes 


An interesting form of ensemble learning is known as error-correcting output codes or ECOC 
(Dietterich and Bakiri 1995), which can be used in the context of multi-class classification. The 
idea is that we are trying to decode a symbol (namely the class label) which has C possible 
states. We could use a bit vector of length B = [log, C] to encode the class label, and train 
B separate binary classifiers to predict each bit. However, by using more bits, and by designing 
the codewords to have maximal Hamming distance from each other, we get a method that is 
more resistant to individual bit-flipping errors (misclassification). For example, in Table 16.2, we 
use B = 15 bits to encode a C = 10 class problem. The minimum Hamming distance between 
any pair of rows is 7. The decoding rule is 


B 
é(x) = min >> [Co — Bo(x)| (16.105) 
b=1 


where Ce» is the b’th bit of the codeword for class c. James and Hastie 1998) showed that a 
random code worked just as well as the optimal code: both methods work by averaging the 
results of multiple classifiers, thereby reducing variance. 


Ensemble learning is not equivalent to Bayes model averaging 


In Section 5.3, we discussed Bayesian model selection. An alternative to picking the best model, 
and then using this to make predictions, is to make a weighted average of the predictions made 
by each model, i.e., we compute 


ply|x,D) = S > p(ylx,m,D)p(mID) (16.106) 
mEM 


This is called Bayes model averaging (BMA), and can sometimes give better performance than 
using any single model (Hoeting et al. 1999). Of course, averaging over all models is typically 
computationally infeasible (analytical integration is obviously not possible in a discrete space, 
although one can sometimes use dynamic programming to perform the computation exactly, 
e.g., (Meila and Jaakkola 2006)). A simple approximation is to sample a few models from the 
posterior. An even simpler approximation (and the one most widely used in practice) is to just 
use the MAP model. 

It is important to note that BMA is not equivalent to ensemble learning (Minka 2000c). This 
latter technique corresponds to enlarging the model space, by defining a single new model 
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MODEL Ist 2ND 3RD 4TH 5TH 6TH 7TH 8TH 9TH 10TH 
BST-DT 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 
RE 0.390 0.525 0.084 0.001 0.000 0.000 0.000 0.000 0.000 0.000 
BAG-DT 0.030 0.232 0.571 0.150 0.017 0.000 0.000 0.000 0.000 0.000 
SVM 0.000 0.008 0148 0.574 0.240 0.029 0.001 0.000 0.000 0.000 
ANN 0.000 0.007 0.035 0.230 0.606 0.122 0.000 0.000 0.000 0.000 
KNN 0.000 0.000 0.000 0.009 0.114 0.592 0.245 0.038 0.002 0.000 
BST-STMP | 0.000 0.000 0.002 0.013 0.014 0.257 0.710 0.004 0.000 0.000 
DT 0.000 0.000 0.000 0.000 0.000 0.000 0.004 0.616 0.291 0.089 
LOGREG 0.000 0.000 0.000 0.000 0.000 0.000 0.040 0.312 0.423 0.225 
NB 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.030 0.284 0.686 


Table 16.3 Fraction of time each method achieved a specified rank, when sorting by mean performance 
across 1l datasets and 8 metrics. Based on Table 4 of (Caruana and Niculescu-Mizil 2006). Used with kind 
permission of Alexandru Niculescu-Mizil. 


which is a convex combination of base models, as follows: 


p(ylx,) = X` mmp(ylx,m) (16.107) 
meM 


In principle, we can now perform Bayesian inference to compute p(z|D); we then make pre- 
dictions using p(y|x,D) = f p(y|x, m)p(a|D)da. However, it is much more common to use 
point estimation methods for 7, as we saw above. 


Experimental comparison 


We have described many different methods for classification and regression. Which one should 
you use? That depends on which inductive bias you think is most appropriate for your domain. 
Usually this is hard to assess, so it is common to just try several different methods, and 
see how they perform empirically. Below we summarize two such comparisons that were 
carefully conducted (although the data sets that were used are relatively small). See the website 
mlcomp.org for a distributed way to perform large scale comparisons of this kind. Of course, 
we must always remember the no free lunch theorem (Section 1.4.9), which tells us that there is 
no universally best learning method. 


Low-dimensional features 


In 2006, Rich Caruana and Alex Niculescu-Mizil (Caruana and Niculescu-Mizil 2006) conducted 
a very extensive experimental comparison of 10 different binary classification methods, on 11 
different data sets. The 1l data sets all had 5000 training cases, and had test sets containing 
~ 10,000 examples on average. The number of features ranged from 9 to 200, so this is much 
lower dimensional than the NIPS 2003 feature selection challenge. 5-fold cross validation was 
used to assess average test error. (This is separate from any internal CV a method may need to 
use for model selection.) 
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The methods they compared are as follows (listed in roughly decreasing order of performance, 
as assessed by Table 16.3): 


e BST-DT: boosted decision trees 
e RF: random forest 

e BAG-DT: bagged decision trees 
e SVM: support vector machine 

e ANN: artificial neural network 
e KNN: K-nearest neighbors 

e BST-STMP: boosted stumps 

e DT: decision tree 

e LOGREG: logistic regression 

e NB: naive Bayes 


They used 8 different performance measures, which can be divided into three groups. Thresh- 
old metrics just require a point estimate as output. These include accuracy, F-score (Sec- 
tion 5.7.2.3), etc. Ordering/ ranking metrics measure how well positive cases are ordered before 
the negative cases. These include area under the ROC curve (Section 5.7.2.1), average precision, 
and the precision/recall break even point. Finally, the probability metrics included cross-entropy 
(log-loss) and squared error, (y — Ĥ)?. Methods such as SVMs that do not produce calibrated 
probabilities were post-processed using Platt’s logistic regression trick (Section 14.5.2.3), or using 
isotonic regression. Performance measures were standardized to a 0:1 scale so they could be 
compared. 

Obviously the results vary by dataset and by metric. Therefore just averaging the performance 
does not necessarily give reliable conclusions. However, one can perform a bootstrap analysis, 
which shows how robust the conclusions are to such changes. The results are shown in 
Table 16.3. We see that most of the time, boosted decision trees are the best method, followed 
by random forests, bagged decision trees, SVMs and neural networks. However, the following 
methods all did relatively poorly: KNN, stumps, single decision trees, logistic regression and 
naive Bayes. 

These results are generally consistent with conventional wisdom of practioners in the field. 
Of course, the conclusions may change if there the features are high dimensional and/ or there 
are lots of irrelevant features (as in Section 16.7.2), or if there is lots of noise, etc. 


High-dimensional features 


In 2003, the NIPS conference ran a competition where the goal was to solve binary classification 
problems with large numbers of (mostly irrelevant) features, given small training sets. (This 
was called a “feature selection” challenge, but performance was measured in terms of predictive 
accuracy, not in terms of the ability to select features.) The five datasets that were used are 
summarized in Table 16.4. The term probe refers to artifical variables that were added to the 
problem to make it harder. These have no predictive power, but are correlated with the original 
features. 

Results of the competition are discussed in (Guyon et al. 2006). The overall winner was an 
approach based on Bayesian neural networks (Neal and Zhang 2006). In a follow-up study 
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Dataset Domain Type D % probes Nirain Novat Neest 
Aracene Mass spectrometry Dense 10,000 30 100 100 700 
Dexter Text classification Sparse 20,000 50 300 300 2000 
Dorothea Drug discovery Sparse 100,000 50 800 350 800 
Gisette Digit recognition Dense 5000 30 6000 1000 6500 
Madelon Artificial Dense 500 96 2000 600 1800 


Table 16.4 Summary of the data used in the NIPS 2003 “feature selection” challenge. For the Dorothea 
datasets, the features are binary. For the others, the features are real-valued. 


Screened features ARD 
Method Avg rank Avg time | Avg rank Avg time 
HMC MLP 1.5 384 (138) | 1.6 600 (186) 
Boosted MLP 3.8 9.4 (8.6) 2.2 35.6 (33.5) 
Bagged MLP 3.6 3.5 (LI) 4.0 6.4 (4.4) 
Boosted trees 3.4 3.03 (2.5) | 4.0 34.1 (32.4) 
Random forests | 2.7 1.9 (1.7) 3.2 11.2 (9.3) 


Table 16.5 Performance of different methods on the NIPS 2003 “feature selection” challenge. (HMC 
stands for hybrid Monte Carlo; see Section 24.5.4.) We report the average rank (lower is better) across the 
5 datasets. We also report the average training time in minutes (standard error in brackets). The MCMC 
and bagged MLPs use two hidden layers of 20 and 8 units. The boosted MLPs use one hidden layer with 2 
or 4 hidden units. The boosted trees used depths between 2 and 9, and shrinkage between 0.001 and 0.1. 
Each tree was trained on 80% of the data chosen at random at each step (so-called stochastic gradient 
boosting). From Table 11.3 of (Hastie et al. 2009). 


Johnson 2009), Bayesian neural nets (MLPs with 2 hidden layers) were compared to several other 
methods based on bagging and boosting. Note that all of these methods are quite similar: in 
each case, the prediction has the form 


f(x) = 5 Wm D [y|x; 0m] (16.108) 


The Bayesian MLP was fit by MCMC (hybrid Monte Carlo), so we set wm = 1/M and set Om 
to a draw from the posterior. In bagging, we set wm = 1/M and Om is estimated by fitting 
the model to a bootstrap sample from the data. In boosting, we set wm = 1 and the Om are 
estimated sequentially. 

To improve computational and statistical performance, some feature selection was performed. 
Two methods were considered: simple uni-variate screening using T-tests, and a method based 
on MLP+ARD. Results of this follow-up study are shown in Table 16.5. We see that Bayesian MLPs 
are again the winner. In second place are either random forests or boosted MLPs, depending 
on the preprocessing. However, it is not clear how statistically significant these differences are, 
since the test sets are relatively small. 

In terms of training time, we see that MCMC is much slower than the other methods. It would 
be interesting to see how well deterministic Bayesian inference (e.g, Laplace approximation) 
would perform. (Obviously it will be much faster, but the question is: how much would one lose 
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Figure 16.20 Partial dependence plots for the 10 predictors in Friedman’s synthetic 5-dimensional re- 
gression problem. Source: Figure 4 of (Chipman et al. 2010) . Used with kind permission of Hugh 
Chipman. 


in statistical performance?) 


Interpreting black-box models 


Linear models are popular in part because they are easy to interpet. However, they often are 
poor predictors, which makes them a poor proxy for “nature’s mechanism”. Thus any conclusions 
about the importance of particular variables should only be based on models that have good 
predictive accuracy (Breiman 2001b). (Interestingly, many standard statistical tests of “goodness 
of fit” do not test the predictive accuracy of a model.) 

In this chapter, we studied black-box models, which do have good predictive accuracy. 
Unfortunately, they are hard to interpret directly. Fortunately, there are various heuristics we can 
use to “probe” such models, in order to assess which input variables are the most important. 

As a simple example, consider the following non-linear function, first proposed (Friedman 
1991) to illustrate the power of MARS: 


f(x) = 10sin(a2122) + 20(r3 — 0.5)? + 10x4 + 545 + € (16.109) 


where € ~ N (0, 1). We see that the output is a complex function of the inputs. By augmenting 
the x vector with additional irrelevant random variables, all drawn uniform on [0,1], we can 
create a challenging feature selection problem. In the experiments below, we add 5 extra dummy 
variables. 
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usage 


0.00 0.05 0.10 015 0.20 0.25 


Figure 16.21 Average usage of each variable in a BART model fit to data where only the first 5 features are 
relevant. The different coloured lines correspond to different numbers of trees in the ensemble. Source: 
Figure 5 of (Chipman et al. 2010) . Used with kind permission of Hugh Chipman. 


One useful way to measure the effect of a set s of variables on the output is to compute a 
partial dependence plot (Friedman 2001). This is a plot of f(x,) vs Xs, where f(x,) is defined 
as the response to x, with the other predictors averaged out: 


N 
1 
f= 2 f(s, Xis) (16.110) 


Figure 16.20 shows an example where we use sets corresponding to each single variable. The data 
was generated from Equation 16.109, with 5 irrelevant variables added. We then fit a BART model 
(Section 16.2.5) and computed the partial dependence plots. We see that the predicted response 
is invariant for s € {6,...,10}, indicating that these variables are (marginally) irrelevant. The 
response is roughly linear in z4 and z5, and roughly quadratic in x3. (The error bars are obtained 
by computing empirical quantiles of f(x,@) based on posterior samples of 6; alternatively, we 
can use bootstrap.) 

Another very useful summary computes the relative importance of predictor variables. 
This can be thought of as a nonlinear, or even “model free”, way of performing variable selection, 
although the technique is restricted to ensembles of trees. The basic idea, originally proposed 
in (Breiman et al. 1984), is to count how often variable j is used as a node in any of the trees. 
In particular, let v; = T Pr I(j € Tm) be the proportion of all splitting rules that use x4, 
where Tm is the m’th tree. If we can sample the posterior of trees, p(Ti:m |D), we can easily 
compute the posterior for v;. Alternatively, we can use bootstrap. 

Figure 16.21 gives an example, using BART. We see that the five relevant variables are chosen 
much more than the five irrelevant variables. As we increase the number M of trees, all the 
variables are more likely to be chosen, reducing the sensitivity of this method, but for small M, 
the method is farily diagnostic. 
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Exercises 


Exercise 16.1 Nonlinear regression for inverse dynamics 


In this question, we fit a model which can predict what torques a robot needs to apply in order to make 
its arm reach a desired point in space. The data was collected from a SARCOS robot arm with 7 degrees of 
freedom. The input vector x € R°” encodes the desired position, velocity and accelaration of the 7 joints. 
The output vector y € R” encodes the torques that should be applied to the joints to reach that point. 
The mapping from x to y is highly nonlinear. 


We have N = 48,933 training points and Nies: = 4,449 testing points. For simplicity, we following 
standard practice and focus on just predicting a scalar output, namely the torque for the first joint. 


Download the data from http://www.gaussianprocess.org/gpml. Standardize the inputs so they 
have zero mean and unit variance on the training set, and center the outputs so they have zero mean 
on the training set. Apply the corresponding transformations to the test data. Below we will describe 
various models which you should fit to this transformed data. Then make predictions and compute the 
standardized mean squared error on the test set as follows: 


1 Eh (ys — 91)? 


SMSE = “test 5 (16.111) 
o 
where o? = N; + 2 in (Cy; — F)? is the variance of the output computed on the training set. 


a. The first method you should try is standard linear regression. Turn in your numbers and code. 
(According to (Rasmussen and Williams 2006, p24), you should be able to achieve a SMSE of 0.075 
using this method.) 


b. Now try running K-means clustering (using cross validation to pick A). Then fit an RBF network to 
the data, using the jz, estimated by K-means. Use CV to estimate the RBF bandwidth. What SMSE do 
you get? Turn in your numbers and code. (According to (Rasmussen and Williams 2006, p24), Gaussian 
process regression can get an SMSE of 0.011, so the goal is to get close to that.) 


c. Now try fitting a feedforward neural network. Use CV to pick the number of hidden units and the 
strength of the £2 regularizer. What SMSE do you get? Turn in your numbers and code. 
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17.2.1 


Markov and hidden Markov models 


Introduction 


In this chapter, we discuss probabilistic models for sequences of observations, X1,...,X 7, of 
arbitrary length T. Such models have applications in computational biology, natural language 
processing, time series forecasting, etc. We focus on the case where we the observations occur 
at discrete “time steps”, although “time” may also refer to locations within a sequence. 


Markov models 


Recall from Section 10.2.2 that the basic idea behind a Markov chain is to assume that X; 
captures all the relevant information for predicting the future (i.e., we assume it is a sufficient 
statistic). If we assume discrete time steps, we can write the joint distribution as follows: 


T 
p(Xiı:r) = p(X1)p( X2|X1)p(X3|Xe) ... = p(X1) [[ eux) (17.1) 


t=2 


This is called a Markov chain or Markov model. 

If we assume the transition function p(X;|X+_1) is independent of time, then the chain is 
called homogeneous, stationary, or time-invariant. This is an example of parameter tying, 
since the same parameter is shared by multiple variables. This assumption allows us to model 
an arbitrary number of variables using a fixed number of parameters; such models are called 
stochastic processes. 

If we assume that the observed variables are discrete, so X, € {1,..., K}, this is called a 
discrete-state or finite-state Markov chain. We will make this assumption throughout the rest of 
this section. 


Transition matrix 


When X; is discrete, so X; € {1,..., A}, the conditional distribution p(X;|X,~1) can be 
written as a K x K matrix, known as the transition matrix A, where A;; = p(X; = 
j|X+-1 = i) is the probability of going from state i to state j. Each row of the matrix sums to 
one, D Aj; = 1, so this is called a stochastic matrix. 
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Figure 17.1 State transition diagrams for some simple Markov chains. Left: a 2-state chain. Right: a 
3-state left-to-right chain. 


A stationary, finite-state Markov chain is equivalent to a stochastic automaton. It is common 
to visualize such automata by drawing a directed graph, where nodes represent states and arrows 
represent legal transitions, i.e., non-zero elements of A. This is known as a state transition 
diagram. The weights associated with the arcs are the probabilities. For example, the following 
2-state chain 


i a ) (17.2) 


£ 1-8 
is illustrated in Figure 17.1(left). The following 3-state chain 
Ay, Ai 0 
A= | 0 Az Az (17.3) 
0 0 1 


is illustrated in Figure 17.1(right). This is called a left-to-right transition matrix, and is com- 
monly used in speech recognition (Section 17.6.2). 

The A;; element of the transition matrix specifies the probability of getting from 7 to j in 
one step. The n-step transition matrix A(n) is defined as 


A(t) 2p San = 7/2 = 1) (17.4) 
which is the probability of getting from 7 to j in exactly n steps. Obviously A(1) = A. The 
Chapman-Kolmogorov equations state that 


Aij (m T n) s Aik(m m)Ax;(n n) (17.5) 


In words, the probability of getting from i to j in m + n steps is just the probability of getting 
from i to k in m steps, and then from k to j in n steps, summed up over all k. We can write 
the above as a matrix multiplication 


A(m +n) = A(m)A(n) (17.6) 
Hence 
A(n) =A A(n—1)=AA A(n—2)=---= A” (17.7) 


Thus we can simulate multiple steps of a Markov chain by “powering up” the transition matrix. 
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SAYS IT’S NOT IN THE CARDS LEGENDARY RECONNAISSANCE BY ROLLIE 
DEMOCRACIES UNSUSTAINABLE COULD STRIKE REDLINING VISITS TO PROFIT 
BOOKING WAIT HERE AT MADISON SQUARE GARDEN COUNTY COURTHOUSE WHERE HE 
HAD BEEN DONE IN THREE ALREADY IN ANY WAY IN WHICH A TEACHER 


Table 17.1 Example output from an 4-gram word model, trained using backoff smoothing on the Broadcast 
News corpus. The first 4 words are specified by hand, the model generates the 5th word, and then the 
results are fed back into the model. Source: http://www.fit.vutbr.cz/~imikolov/rnnlm/gen-4gra 
m.txt . 


Application: Language modeling 


One important application of Markov models is to make statistical language models, which are 
probability distributions over sequences of words. We define the state space to be all the words 
in English (or some other language). The marginal probabilities p(X, = k) are called unigram 
statistics. If we use a first-order Markov model, then p(X; = k|X;~1 = j) is called a bigram 
model. If we use a second-order Markov model, then p(X; = k|X:-1 = j, X¢_-2 = i) is 
called a trigram model. And so on. In general these are called n-gram models. For example, 
Figure 17.2 shows l-gram and 2-grams counts for the letters {a,...,z,—} (where - represents 
space) estimated from Darwin’s On The Origin Of Species. 
Language models can be used for several things, such as the following: 


e Sentence completion A language model can predict the next word given the previous 
words in a sentence. This can be used to reduce the amount of typing required, which is 
particularly important for disabled users (see e.g., David Mackay’s Dasher system!), or uses of 
mobile devices. 

e Data compression Any density model can be used to define an encoding scheme, by 
assigning short codewords to more probable strings. The more accurate the predictive model, 
the fewer the number of bits it requires to store the data. 

e Text classification Any density model can be used as a class-conditional density and hence 
turned into a (generative) classifier. Note that using a 0-gram class-conditional density (i.e., 
only unigram statistics) would be equivalent to a naive Bayes classifier (see Section 3.5). 

e Automatic essay writing One can sample from p(21.;) to generate artificial text. This is 
one way of assessing the quality of the model. In Table 17.1, we give an example of text 
generated from a 4-gram model, trained on a corpus with 400 million words. ((Tomas et al. 
2011) describes a much better language model, based on recurrent neural networks, which 
generates much more semantically plausible text.) 


l. http: //www.inference.phy.cam.ac.uk/dasher/ 
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Figure 17.2 Unigram and bigram counts from Darwin’s On The Origin Of Species. The 2D picture on the 
right is a Hinton diagram of the joint distribution. The size of the white squares is proportional to the 
value of the entry in the corresponding vector/ matrix. Based on (MacKay 2003, p22). Figure generated by 
ngramPlot. 


17.2.2... MLE for Markov language models 


We now discuss a simple way to estimate the transition matrix from training data. The proba- 
bility of any particular sequence of length T is given by 


p(a1:7|O0) = m(a1)A(ai, 22)... A(£T-1, 27) (17.8) 
K T K K 
= [e II II i (A jp, )Mee=bte-1=3) (17.9) 
j=l t=2 j=1 k=1 
Hence the log-likelihood of a set of sequences D = (x1,..., Xy), where x; = (®j1,...,i,7;) 


is a sequence of length T;, is given by 
N 
log p(D|@) = 5 log p(x;|0) = a N} log t; + 5 5 Njp log Ajk (17.10) 
i=1 j j k 


where we define the following counts: 


N N Ti-l 
NS) lee = 5), NaN YO Heit = j, tit = k) (17.11) 


i=1 i=l t=1 
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Hence we can write the MLE as the normalized counts: 


1 
Ni g Nik 


ij , Ap =e 
i XN} a in Nie 


These results can be extended in a straightforward way to higher order Markov models. 
However, the problem of zero-counts becomes very acute whenever the number of states K, 
and/or the order of the chain, n, is large. An n-gram models has O(K”) parameters. If we have 
K ~ 50,000 words in our vocabulary, then a bi-gram model will have about 2.5 billion free 
parameters, corresponding to all possible word pairs. It is very unlikely we will see all of these 
in our training data. However, we do not want to predict that a particular word string is totally 
impossible just because we happen not to have seen it in our training text — that would be a 
severe form of overfitting.” 

A simple solution to this is to use add-one smoothing, where we simply add one to all the 
empirical counts before normalizing. The Bayesian justification for this is given in Section 3.3.4.1. 
However add-one smoothing assumes all n-grams are equally likely, which is not very realistic. 
A more sophisticated Bayesian approach is discussed in Section 17.2.2.2. 

An alternative to using smart priors is to gather lots and lots of data. For example, Google 
has fit n-gram models (for n = 1 : 5) based on one trillion words extracted from the web. Their 
data, which is over 100GB when uncompressed, is publically available? An example of their 
data, for a set of 4-grams, is shown below. 


(17.12) 


serve as the incoming 92 

serve as the incubator 99 
serve as the independent 794 
serve as the index 223 

serve as the indication 72 
serve as the indicator 120 
serve as the indicators 45 
serve as the indispensable 111 
serve as the indispensible 40 
serve as the individual 234 


Although such an approach, based on “brute force and ignorance”, can be successful, it is 
rather unsatisfying, since it is clear that this is not how humans learn (see e.g., (Tenenbaum 
and Xu 2000)). A more refined Bayesian approach, that needs much less data, is described in 
Section 17.2.2.2. 


Empirical Bayes version of deleted interpolation 


A common heuristic used to fix the sparse data problem is called deleted interpolation (Chen 
and Goodman 1996). This defines the transition matrix as a convex combination of the bigram 


2. A famous example of an improbable, but syntactically valid, English word string, due to Noam Chomsky, is “colourless 
green ideas sleep furiously”. We would not want our model to predict that this string is impossible. Even ungrammatical 
constructs should be allowed by our model with a certain probability, since people frequently violate grammatical rules, 
especially in spoken language. 

3. See http: //googleresearch.blogspot.com/2006/08/al11-our-n-gram-are-belong-to-you.html for de- 
tails. 
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frequencies fj; = Njx/N; and the unigram frequencies fy = Ng /N: 
Ajk = (1—A) fin + Afk (17.13) 


The term A is usually set by cross validation. There is also a closely related technique called 
backoff smoothing; the idea is that if f;;, is too small, we “back off” to a more reliable estimate, 
namely fp. 

We will now show that the deleted interpolation heuristic is an approximation to the predic- 
tions made by a simple hierarchical Bayesian model. Our presentation follows (McKay and Peto 
1995). First, let us use an independent Dirichlet prior on each row of the transition matrix: 


A; ~ Dir(agm,...,a9mK) = Dir(agm) = Dir(a) (17.14) 


where A; is row j of the transition matrix, m is the prior mean (satisfying 5 ; Mx = 1) and 
ag is the prior strength. We will use the same prior for each row: see Figure 17.3. 

The posterior is given by A; ~ Dir(a+N,;), where N; = (Nj1,...,Njx) is the vector 
that records the number of times we have transitioned out of state 7 to each of the other states. 
From Equation 3.51, the posterior predictive density is 
Nix tamp _ firNj + ame 


X k| X. i, D Agr = = (1-(;)f; A; Mm 17.15 
P(X |Xt =5,P) jk N; +0 Nj + a0 ( afin + ima 


where Aj, = E[Ajx|D, a] and 
a 
7 N; + Qo 


Aj (17.16) 
This is very similar to Equation 17.13 but not identical. The main difference is that the Bayesian 
model uses a context-dependent weight A; to combine m,; with the empirical frequency fjg, 
rather than a fixed weight A. This is like adaptive deleted interpolation. Furthermore, rather 
than backing off to the empirical marginal frequencies fp, we back off to the model parameter 
Mk. 

The only remaining question is: what values should we use for a and m? Let's use empirical 
Bayes. Since we assume each row of the transition matrix is a priori independent given a, the 
marginal likelihood for our Markov model is found by applying Equation 5.24 to each row: 


p(D\a) = I] a (17.17) 


where N; = (Nyji,...,Nj«) are the counts for leaving state j and B(œ) is the generalized 
beta function. 

We can fit this using the methods discussed in (Minka 2000e). However, we can also use the 
following approximation (McKay and Peto 1995, p12): 


mr x {J : Nik > O}| (17.18) 


This says that the prior probability of word k is given by the number of different contexts in 
which it occurs, rather than the number of times it occurs. To justify the reasonableness of this 
result, Mackay and Peto (McKay and Peto 1995) give the following example. 
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Figure 17.3 A Markov chain in which we put a different Dirichlet prior on every row of the transition 
matrix A, but the hyperparameters of the Dirichlet are shared. 


Imagine, you see, that the language, you see, has, you see, a 
frequently occuring couplet ’you see’, you see, in which the second 
word of the couplet, see, follows the first word, you, with very high 
probability, you see. Then the marginal statistics, you see, are going 
to become hugely dominated, you see, by the words you and see, with 
equal frequency, you see. 


If we use the standard smoothing formula, Equation 17.13, then P(you|novel) and P(see|novel), 
for some novel context word not seen before, would turn out to be the same, since the marginal 
frequencies of ’you’ and ’see’ are the same (l times each). However, this seems unreasonable. 
‘You’ appears in many contexts, so P(you|novel) should be high, but ‘see’ only follows ‘you’, so 
P(see|novel) should be low. If we use the Bayesian formula Equation 17.15, we will get this effect 
for free, since we back off to mg not fk, and Mmg will be large for ’you’ and small for ’see’ by 
Equation 17.18. 

Unfortunately, although elegant, this Bayesian model does not beat the state-of-the-art lan- 
guage model, known as interpolated Kneser-Ney (Kneser and Ney 1995; Chen and Goodman 
1998). However, in (Teh 2006), it was shown how one can build a non-parametric Bayesian 
model which outperforms interpolated Kneser-Ney, by using variable-length contexts. In (Wood 
et al. 2009), this method was extended to create the “sequence memoizer”, which is currently 
(2010) the best-performing language model. 


Handling out-of-vocabulary words 


While the above smoothing methods handle the case where the counts are small or even zero, 
none of them deal with the case where the test set may contain a completely novel word. In 
particular, they all assume that the words in the vocabulary (i.e., the state space of X+) is fixed 
and known (typically it is the set of unique words in the training data, or in some dictionary). 


4. Interestingly, these non-parametric methods are based on posterior inference using MCMC (Section 24.1) and/or 
particle filtering (Section 23.5), rather than optimization methods such as EB. Despite this, they are quite efficient. 
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Figure 17.4 Some Markov chains. (a) A 3-state aperiodic chain. (b) A reducible 4-state chain. 


Even if all Ajx’s are non-zero, none of these models will predict a novel word outside of this set, 
and hence will assign zero probability to a test sentence with an unfamiliar word. (Unfamiliar 
words are bound to occur, because the set of words is an open class. For example, the set of 
proper nouns (names of people and places) is unbounded.) 

A standard heuristic to solve this problem is to replace all novel words with the special symbol 
unk, which stands for “unknown”. A certain amount of probability mass is held aside for this 
event. 

A more principled solution would be to use a Dirichlet process, which can generate a countably 
infinite state space, as the amount of data increases (see Section 25.2.2). If all novel words are 
“accepted” as genuine words, then the system has no predictive power, since any misspelling 
will be considered a new word. So the novel word has to be seen frequently enough to warrant 
being added to the vocabulary. See e.g., (Friedman and Singer 1999; Griffiths and Tenenbaum 
2001) for details. 


Stationary distribution of a Markov chain * 


We have been focussing on Markov models as a way of defining joint probability distributions 
over sequences. However, we can also interpret them as stochastic dynamical systems, where 
we “hop” from one state to another at each time step. In this case, we are often interested in the 
long term distribution over states, which is known as the stationary distribution of the chain. 
In this section, we discuss some of the relevant theory. Later we will consider two important 
applications: Google’s PageRank algorithm for ranking web pages (Section 17.2.4), and the MCMC 
algorithm for generating samples from hard-to-normalize probability distributions (Chapter 24). 


What is a stationary distribution? 


Let Aj; = p(X: = j|Xt—-1 = i) be the one-step transition matrix, and let m:(j) = p(X: = j) 
be the probability of being in state j at time t. It is conventional in this context to assume that 
m is a row vector. If we have an initial distribution over states of 79, then at time 1 we have 


milj) = X molt) Aij (17.19) 


or, in matrix notation, 


mah (17.20) 
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We can imagine iterating these equations. If we ever reach a stage where 
T=TÀÅ (17.21) 


then we say we have reached the stationary distribution (also called the invariant distribution 
or equilibrium distribution). Once we enter the stationary distribution, we will never leave. 
For example, consider the chain in Figure 17.4(a). To find its stationary distribution, we write 


1— Ajo = Aj3 Aj2 Aj3 
(mı T2 T3) = (mı T2 T3) Agi 1 — A21 — A23 Ags (17.22) 
A31 A32 1 — Az; — A32 
so 
Tı = m(1 a Ajo = A12) + Ao + T3 Å31 (17.23) 
or 
Tmı(Ai2 + Aig) = T2421 + 73A31 (17.24) 


In general, we have 


mi Y Aij = X TjAji (17.25) 
j+i j#t 
In other words, the probability of being in state i times the net flow out of state i must equal 
the probability of being in each other state 7 times the net flow from that state into i. These 
are called the global balance equations. We can then solve these equations, subject to the 
constraint that >`; mj = 1. 


Computing the stationary distribution 


To find the stationary distribution, we can just solve the eigenvector equation A7 v = v, and 
T where v is an eigenvector with eigenvalue 1. (We can be sure such an 
eigenvector exists, since A is a row-stochastic matrix, so A1 = 1; also recall that the eigenvalues 
of A and A? are the same.) Of course, since eigenvectors are unique only up to constants of 
proportionality, we must normalize v at the end to ensure it sums to one. 

Note, however, that the eigenvectors are only guaranteed to be real-valued if the matrix is 
positive, A;; > 0 (and hence A;; < 1, due to the sum-to-one constraint). A more general 
approach, which can handle chains where some transition probabilities are 0 or 1 (such as 
Figure 17.4(a)), is as follows (Resnick 1992, p138). We have K constraints from m(I— A) = Ox x1 
and 1 constraint from mlgxı = 0. Since we only have K unknowns, this is overconstrained. 
So let us replace any column (e.g., the last) of I— A with 1, to get a new matrix, call it M. 
Next we define r = [0,0,..., 1], where the 1 in the last position corresponds to the column of 
all Is in M. We then solve mM = r. For example, for a 3 state chain we have to solve this 
linear system: 


then to set 7 = v 


1—Ay, —Ay 1 
(mı T2 T3) —Ao 1- Aga 1 = (0 0 1) (17.26) 
=z Az 1 
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For the chain in Figure 17.4(a) we find m = [0.4,0.4,0.2]. We can easily verify this is correct, 
since m = 7A. See mcStatDist for some Matlab code. 
Unfortunately, not all chains have a stationary distribution. as we explain below. 


When does a stationary distribution exist? * 


Consider the 4-state chain in Figure 17.4(b). If we start in state 4, we will stay there forever, since 
4 is an absorbing state. Thus m = (0,0,0,1) is one possible stationary distribution. However, 
if we start in 1 or 2, we will oscillate between those two states for ever. So m = (0.5, 0.5, 0,0) 
is another possible stationary distribution. If we start in state 3, we could end up in either of 
the above stationary distributions. 

We see from this example that a necessary condition to have a unique stationary distribution 
is that the state transition diagram be a singly connected component, i.e., we can get from any 
state to any other state. Such chains are called irreducible. 

Now consider the 2-state chain in Figure 17.1(a). This is irreducible provided a,3 > 0. 
Suppose a = 8 = 0.9. It is clear by symmetry that this chain will spend 50% of its time in 
each state. Thus m = (0.5,0.5). But now suppose a = 8 = 1. In this case, the chain will 
oscillate between the two states, but the long-term distribution on states depends on where you 
start from. If we start in state 1, then on every odd time step (1,3,5,...) we will be in state 1; but 
if we start in state 2, then on every odd time step we will be in state 2. 

This example motivates the following definition. Let us say that a chain has a limiting 
distribution if 7; = limn—co Ay, exists and is independent of i, for all j. If this holds, then 
the long-run distribution over states will be independent of the starting state: 


P(X, = j) = XL P(X = 1) Aij(t) > T; as t + 00 (17.27) 


Let us now characterize when a limiting distribution exists. Define the period of state 7 to be 
d(i) = ged{t : Aji(t) > 0} (17.28) 


where gcd stands for greatest common divisor, i.e., the largest integer that divides all the 
members of the set. For example, in Figure 17.4(a), we have d(1) = d(2) = gcd(2,3,4,6,...) =1 
and d(3) = gcd(3, 5,6, ...) = 1. We say a state 7 is aperiodic if d(i) = 1. (A sufficient condition 
to ensure this is if state 7 has a self-loop, but this is not a necessary condition.) We say a chain 
is aperiodic if all its states are aperiodic. One can show the following important result: 


Theorem 17.2.1. Every irreducible (singly connected), aperiodic finite state Markov chain has a 
limiting distribution, which is equal to 7, its unique stationary distribution. 


A special case of this result says that every regular finite state chain has a unique stationary 
distribution, where a regular chain is one whose transition matrix satisfies Af, > 0 for some 
integer n and all ¿, j, i.e., it is possible to get from any state to any other state in n steps. 
Consequently, after n steps, the chain could be in any state, no matter where it started. One 
can show that sufficient conditions to ensure regularity are that the chain be irreducible (singly 
connected) and that every state have a self-transition. 

To handle the case of Markov chains whose state-space is not finite (e.g, the countable set of 
all integers, or all the uncountable set of all reals), we need to generalize some of the earlier 
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definitions. Since the details are rather technical, we just briefly state the main results without 
proof. See e.g., (Grimmett and Stirzaker 1992) for details. 

For a stationary distribution to exist, we require irreducibility (singly connected) and aperiod- 
icity, as before. But we also require that each state is recurrent. (A chain in which all states 
are recurrent is called a recurrent chain.) Recurrent means that you will return to that state 
with probability 1. As a simple example of a non-recurrent state (i.e., a transient state), consider 
Figure 17.4(b): states 3 is transient because one immediately leaves it and either spins around 
state 4 forever, or oscillates between states 1 and 2 forever. There is no way to return to state 3. 

It is clear that any finite-state irreducible chain is recurrent, since you can always get back to 
where you started from. But now consider an example with an infinite state space. Suppose we 
perform a random walk on the integers, ¥ = {...,—2,—1,0,1,2,...}. Let A;i41 = p be the 
probability of moving right, and A; ;-1 = 1 — p be the probability of moving left. Suppose we 
start at X; = 0. If p > 0.5, we will shoot off to +00; we are not guaranteed to return. Similarly, 
if p < 0.5, we will shoot off to —oo. So in both cases, the chain is not recurrent, even though 
it is irreducible. 

It should be intuitively obvious that we require all states to be recurrent for a stationary 
distribution to exist. However, this is not sufficient. To see this, consider the random walk 
on the integers again, and suppose p = 0.5. In this case, we can return to the origin an 
infinite number of times, so the chain is recurrent. However, it takes infinitely long to do 
so. This prohibits it from having a stationary distribution. The intuitive reason is that the 
distribution keeps spreading out over a larger and larger set of the integers, and never converges 
to a stationary distribution. More formally, we define a state to be non-null recurrent if the 
expected time to return to this state is finite. A chain in which all states are non-null is called a 
non-null chain. 

For brevity, we we say that a state is ergodic if it is aperiodic, recurrent and non-null, and 
we say a chain is ergodic if all its states are ergodic. 

We can now state our main theorem: 


Theorem 17.2.2. Every irreducible (singly connected), ergodic Markov chain has a limiting distri- 
bution, which is equal to 7, its unique stationary distribution. 


This generalizes Theorem 17.2.1, since for irreducible finite-state chains, all states are recurrent 
and non-null. 
Detailed balance 


Establishing ergodicity can be difficult. We now give an alternative condition that is easier to 


verify. 
We say that a Markov chain A is time reversible if there exists a distribution m such that 


These are called the detailed balance equations. This says that the flow from i to j must 
equal the flow from j to i, weighted by the appropriate source probabilities. 
We have the following important result. 


Theorem 17.2.3. If a Markov chain with transition matrix A is regular and satisfies detailed 
balance wrt distribution 7, then m is a stationary distribution of the chain. 
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aA 


Figure 17.5 A very small world wide web. Figure generated by pagerankDemo, written by Tim Davis. 


Proof. To see this, note that 
X Ay =) my Aj = my >) Aji = 5 (17.30) 


and hence 7 = Ar. 


Note that this condition is sufficient but not necessary (see Figure 17.4(a) for an example of a 
chain with a stationary distribution which does not satisfy detailed balance). 

In Section 24.1, we will discuss Markov chain Monte Carlo or MCMC methods. These take 
as input a desired distribution m and construct a transition matrix (or in general, a transition 
kernel) A which satisfies detailed balance wrt m. Thus by sampling states from such a chain, 
we will eventually enter the stationary distribution, and will visit states with probabilities given 
by r. 


Application: Google’s PageRank algorithm for web page ranking * 


The results in Section 17.2.3 form the theoretical underpinnings to Google’s PageRank algorithm, 
which is used for information retrieval on the world-wide web. We sketch the basic idea below; 
see (Byran and Leise 2006) for a more detailed explanation. 

We will treat the web as a giant directed graph, where nodes represent web pages (documents) 
and edges represent hyper-links.° We then perform a process called web crawling. We start at 
a few designated root nodes, such as dmoz.org, the home of the Open Directory Project, and 
then follows the links, storing all the pages that we encounter, until we run out of time. 

Next, all of the words in each web page are entered into a data structure called an inverted 
index. That is, for each word, we store a list of the documents where this word occurs. (In 
practice, we store a list of hash codes representing the URLs.) At test time, when a user enters 


5. In 2008, Google said it had indexed 1 trillion (101?) unique URLs. If we assume there are about 10 URLs per page 
(on average), this means there were about 100 billion unique web pages. Estimates for 2010 are about 121 billion unique 
web pages. Source: thenextweb.com/shareables/2011/01/11/infographic-how-big-is-the-internet. 
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a query, we can just look up all the documents containing each word, and intersect these 
lists (since queries are defined by a conjunction of search terms). We can get a refined search 
by storing the location of each word in each document. We can then test if the words in a 
document occur in the same order as in the query. 

Let us give an example, from http://en.wikipedia.org/wiki/Inverted_index. We 
have 3 documents, Tọ = “it is what it is”, T} = “what is it” and Tọ = “it is a banana’. Then 
we can create the following inverted index, where each pair represents a document and word 
location: 


"a": {(2, 2)} 

"banana": {(2, 3)} 

"is": {(0O, 1), O 4, (1, 1), (2, 1)} 
"it": {(0, 0), (0, 3), (1, 2), (2, 0)} 
"what": {(0, 2), (1, ©} 


For example, we see that the word “what” occurs at location 2 (counting from 0) in document 
0, and location 0 in document 1. Suppose we search for “what is it’. If we ignore word order, 
we retrieve the following documents: 


{To, T1} N {To, Ti, T2} N {To, Ti, T2} = {To, Ti} 07.31) 


If we require that the word order matches, only document Tı would be returned. More generally, 
we can allow out-of-order matches, but can give “bonus points” to documents whose word order 
matches the query’s word order, or to other features, such as if the words occur in the title of 
a document. We can then return the matching documents in decreasing order of their score/ 
relevance. This is called document ranking. 

So far, we have described the standard process of information retrieval. But the link structure 
of the web provides an additional source of information. The basic idea is that some web pages 
are more authoritative than others, so these should be ranked higher (assuming they match 
the query). A web page is an authority if it is linked to by many other pages. But to protect 
against the effect of so-called link farms, which are dummy pages which just link to a given 
site to boost its apparent relevance, we will weight each incoming link by the source's authority. 
Thus we get the following recursive definition for the authoritativeness of page j, also called its 
PageRank: 


my = >) Agni (17.32) 


where A;; is the probability of following a link from i to j. We recognize Equation 17.32 as the 
stationary distribution of a Markov chain. 

In the simplest setting, we define A; as a uniform distribution over all states that i is 
connected to. However, to ensure the distribution is unique, we need to make the chain into a 
regular chain. This can be done by allowing each state i to jump to any other state (including 
itself) with some small probability. This effectively makes the transition matrix aperiodic and 
fully connected (although the adjacency matrix G';; of the web itself is highly sparse). 

We discuss efficient methods for computing the leading eigenvector of this giant matrix below. 
But first, let us give an example of the PageRank algorithm. Consider the small web in Figure 17.5. 
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Figure 17.6 (a) Web graph of 500 sites rooted at www.harvard.edu. (b) Corresponding page rank vector. 
Figure generated by pagerankDemoPmtk, Based on code by Cleve Moler (Moler 2004). 


We find that the stationary distribution is 
m = (0.3209, 0.1706, 0.1065, 0.1368, 0.0643, 0.2008) (17.33) 


So a random surfer will visit site 1 about 32% of the time. We see that node 1 has a higher 
PageRank than nodes 4 or 6, even though they all have the same number of in-links. This is 
because being linked to from an influential nodehelps increase your PageRank score more than 
being linked to by a less influential node. 

As a slightly larger example, Figure 17.6(a) shows a web graph, derived from the root of 
harvard.edu. Figure 17.6(b) shows the corresponding PageRank vector. 


Efficiently computing the PageRank vector 


Let Gi; = 1 iff there is a link from j to i. Now imagine performing a random walk on 
this graph, where at every time step, with probability p = 0.85 you follow one of the outlinks 
uniformly at random, and with probability 1 — p you jump to a random node, again chosen 
uniformly at random. If there are no outlinks, you just jump to a random page. (These random 
jumps, including self-transitions, ensure the chain is irreducible (singly connected) and regular. 
Hence we can solve for its unique stationary distribution using eigenvector methods.) This 
defines the following transition matrix: 


S pGy/c; +8 if cj #0 
Mi; = { ia fe; =0 (17.34) 


where n is the number of nodes, 6 = (1 — p)/n is the probability of jumping from one page 
to another without following a link and c; = $; Gj; represents the out-degree of page j. (If 
n = 4-10° and p = 0.85, then ô = 3.75- 10711.) Here M is a stochastic matrix in which 
columns sum to one. Note that M = A” in our earlier notation. 

We can represent the transition matrix compactly as follows. Define the diagonal matrix D 
with entries 


= fle fe; £0 
dj; = { z if c, = 0 (17.35) 
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Define the vector z with components 


o f ~ fe; £0 


Then we can rewrite Equation 17.34 as follows: 
M = pGD + 127 (17.37) 


The matrix M is not sparse, but it is a rank one modification of a sparse matrix. Most of the 
elements of M are equal to the small constant ô. Obviously these do not need to be stored 
explicitly. 

Our goal is to solve v = Mv, where v = n7. One efficient method to find the leading 
eigenvector of a large matrix is known as the power method. This simply consists of repeated 
matrix-vector multiplication, followed by normalization: 


v x Mv = pGDv + 127 v (17.38) 


It is possible to implement the power method without using any matrix multiplications, by 
simply sampling from the transition matrix and counting how often you visit each state. This is 
essentially a Monte Carlo approximation to the sum implied by v = Mv. Applying this to the 
data in Figure 17.6(a) yields the stationary distribution in Figure 17.6(b). This took 13 iterations to 
converge, starting from a uniform distribution. (See also the function pagerankDemo, by Tim 
Davis, for an animation of the algorithm in action, applied to the small web example.) To handle 
changing web structure, we can re-run this algorithm every day or every week, starting v off at 
the old distribution (Langville and Meyer 2006). 

For details on how to perform this Monte Carlo power method in a parallel distributed 
computing environment, see e.g., (Rajaraman and Ullman 2010). 


Web spam 


PageRank is not foolproof. For example, consider the strategy adopted by JC Penney, a depart- 
ment store in the USA. During the Christmas season of 2010, it planted many links to its home 
page on 1000s of irrelevant web pages, thus increasing its ranking on Google's search engine 
(Segal 2011). Even though each of these source pages has low PageRank, there were so many 
of them that their effect added up. Businesses call this search engine optimization; Google 
calls it web spam. When Google was notified of this scam (by the New York Times), it manually 
downweighted JC Penney, since such behavior violates Google’s code of conduct. The result 
was that JC Penney dropped from rank 1 to rank 65, essentially making it disappear from view. 
Automatically detecting such scams relies on various techniques which are beyond the scope of 
this chapter. 


Hidden Markov models 


As we mentioned in Section 10.2.2, a hidden Markov model or HMM consists of a discrete-time, 
discrete-state Markov chain, with hidden states z; € {1,..., K}, plus an observation model 
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Figure 17.7 (a) Some 2d data sampled from a 3 state HMM. Each state emits from a 2d Gaussian. (b) The 
hidden state sequence. Based on Figure 13.8 of (Bishop 2006b). Figure generated by hmmLillypadDemo. 


p(xz|2z). The corresponding joint distribution has the form 


T T 
Par, Xr) = plar)p(x1-7lZur) = ey Tate] Tori (17.39) 


The observations in an HMM can be discrete or continuous. If they are discrete, it is common 
for the observation model to be an observation matrix: 


p(x = Iz, = k,0) = B(k, 1) (17.40) 


If the observations are continuous, it is common for the observation model to be a conditional 
Gaussian: 


pxl = k, 0) = N (xih, x) (17.41) 


Figure 17.7 shows an example where we have 3 states, each of which emits a different Gaussian. 
The resulting model is similar to a Gaussian mixture model, except the cluster membership 
has Markovian dynamics. (Indeed, HMMs are sometimes called Markov switching models 
(Fruhwirth-Schnatter 2007).) We see that we tend to get multiple observations in the same 
location, and then a sudden jump to a new cluster. 


Applications of HMMs 


HMMs can be used as black-box density models on sequences. They have the advantage 
over Markov models in that they can represent long-range dependencies between observations, 
mediated via the latent variables. In particular, note that they do not assume the Markov 
property holds for the observations themselves. Such black-box models are useful for time- 
series prediction (Fraser 2008). They can also be used to define class-conditional densities 
inside a generative classifier. 

However, it is more common to imbue the hidden states with some desired meaning, and to 
then try to estimate the hidden states from the observations, i.e., to compute p(z;|X1:1) if we are 


17.3. Hidden Markov models 605 


X 

bat A 

rat A- AG- 

cat A 

gnat - - AAAC 

goat AG- - -C 
1 2 


Figure 17.8 (a) Some DNA sequences. (b) State transition diagram for a profile HMM. Source: Figure 5.7 
of (Durbin et al. 1998). Used with kind permission of Richard Durbin. 


in an online scenario, or p(z;|X1.7) if we are in an offline scenario (see Section 17.4.1 for further 
discussion of the differences between these two approaches). Below we give some examples of 
applications which use HMMs in this way: 


e Automatic speech recognition. Here x; represents features extracted from the speech 
signal, and z; represents the word that is being spoken. The transition model p(z;| 2,1) 
represents the language model, and the observation model p(x;|z;) represents the acoustic 
model. See e.g., Jelinek 1997; Jurafsky and Martin 2008) for details. 

e Activity recognition. Here x; represents features extracted from a video frame, and z+ is 
the class of activity the person is engaged in (e.g., running, walking, sitting, etc.) See e.g. 
(Szeliski 2010) for details. 

e Part of speech tagging. Here x; represents a word, and z; represents its part of speech 
(noun, verb, adjective, etc.) See Section 19.6.2.1 for more information on POS tagging and 
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related tasks. 

e Gene finding. Here x; represents the DNA nucleotides (A,C,G,T), and z; represents whether 
we are inside a gene-coding region or not. See e.g., (Schweikerta et al. 2009) for details. 

e Protein sequence alignment. Here x; represents an amino acid, and z; represents whether 
this matches the latent consensus sequence at this location. This model is called a profile 
HMM and is illustrated in Figure 17.8. The HMM has 3 states, called match, insert and delete. 
If z is a match state, then x, is equal to the t’th value of the consensus. If z, is an insert 
state, then x, is generated from a uniform distribution that is unrelated to the consensus 
sequence. If z; is a delete state, then 7, = —. In this way, we can generate noisy copies of 
the consensus sequence of different lengths. In Figure 17.8(a), the consensus is “AGC”, and 
we see various versions of this below. A path through the state transition diagram, shown 
in Figure 17.8(b), specifies how to align a sequence to the consensus, e.g., for the gnat, the 
most probable path is D, D,I,I,I, M. This means we delete the A and G parts of the 
consensus sequence, we insert 3 A’s, and then we match the final C. We can estimate the 
model parameters by counting the number of such transitions, and the number of emissions 
from each kind of state, as shown in Figure 17.8(c). See Section 17.5 for more information on 
training an HMM, and (Durbin et al. 1998) for details on profile HMMs. 


Note that for some of these tasks, conditional random fields, which are essentially discrimi- 
native versions of HMMs, may be more suitable; see Chapter 19 for details. 


Inference in HMMs 


We now discuss how to infer the hidden state sequence of an HMM, assuming the parameters 
are known. Exactly the same algorithms apply to other chain-structured graphical models, such 
as chain CRFs (see Section 19.6.1). In Chapter 20, we generalize these methods to arbitrary 
graphs. And in Section 17.5.2, we show how we can use the output of inference in the context 
of parameter estimation. 


Types of inference problems for temporal models 


There are several different kinds of inferential tasks for an HMM (and SSM in general). To 
illustrate the differences, we will consider an example called the occasionally dishonest casino, 
from (Durbin et al. 1998). In this model, x; € {1,2,...,6} represents which dice face shows 
up, and z; represents the identity of the dice that is being used. Most of the time the casino 
uses a fair dice, z = 1, but occasionally it switches to a loaded dice, z = 2, for a short period. 
If z = 1 the observation distribution is a uniform multinoulli over the symbols {1,...,6}. If 
z = 2, the observation distribution is skewed towards face 6 (see Figure 17.9). If we sample from 
this model, we may observe data such as the following: 


Listing 17.1 Example output of casinoDemo 
Rolls: 664153216162115234653214356634261655234232315142464156663246 
Die: LLLLLLLLLLLLLLFFFFFFLLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFLLLLLLLL 


Here “rolls” refers to the observed symbol and “die” refers to the hidden state (L is loaded and 
F is fair). Thus we see that the model generates a sequence of symbols, but the statistics of the 
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Figure 17.9 An HMM for the occasionally dishonest casino. The blue arrows visualize the state transition 
diagram A. Based on (Durbin et al. 1998, p54). 


fitered smoothed Viterbi 


Figure 17.10 Inference in the dishonest casino. Vertical gray bars denote the samples that we generated 
using a loaded die. (a) Filtered estimate of probability of using a loaded dice. (b) Smoothed estimates. (c) 
MAP trajectory. Figure generated by casinoDemo. 


distribution changes abruptly every now and then. In a typical application, we just see the rolls 
and want to infer which dice is being used. But there are different kinds of inference, which we 
summarize below. 


e Filtering means to compute the belief state p(z,|x,.,) online, or recursively, as the data 
streams in. This is called “filtering” because it reduces the noise more than simply estimating 
the hidden state using just the current estimate, p(z;|x,). We will see below that we can 
perform filtering by simply applying Bayes rule in a sequential fashion. See Figure 17.10(a) for 
an example. 

e Smoothing means to compute p(z;|x1-7) offline, given all the evidence. See Figure 17.10(b) 
for an example. By conditioning on past and future data, our uncertainty will be significantly 
reduced. To understand this intuitively, consider a detective trying to figure out who com- 
mitted a crime. As he moves through the crime scene, his uncertainty is high until he finds 
the key clue; then he has an “aha” moment, his uncertainty is reduced, and all the previously 
confusing observations are, in hindsight, easy to explain. 
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filtering 


prediction 


fixed-lag 
smoothing 


IZ 


+ 


fixed-lag t 
smoothing 
(offline) 


Figure 17.11 The main kinds of inference for state-space models. The shaded region is the interval for 
which we have data. The arrow represents the time step at which we want to perform inference. t is the 
current time, T is the sequence length, @ is the lag and A is the prediction horizon. See text for details. 


e Fixed lag smoothing is an interesting compromise between online and offline estimation; it 
involves computing p(z;—¢|X1.1), where £ > 0 is called the lag. This gives better performance 
than filtering, but incurs a slight delay. By changing the size of the lag, one can trade off 
accuracy vs delay. 

e Prediction Instead of predicting the past given the future, as in fixed lag smoothing, we 
might want to predict the future given the past, i.e., to compute p(zt+h|X1:+), where h > 0 
is called the prediction horizon. For example, suppose h = 2; then we have 


P(Ze42|X14) = XO YO p(ze+2lze41)p(ze41|2)p(24|x1:2) (17.42) 


Zt+1 Zt 


It is straightforward to perform this computation: we just power up the transition matrix and 
apply it to the current belief state. The quantity p(z:+;|x1-1) is a prediction about future 
hidden states; it can be converted into a prediction about future observations using 


P(Xt+n|X1:t) = D P(Xt+nlZe+n)P(Ze+h|X1:t) (17.43) 
Zt+h 
This is the posterior predictive density, and can be used for time-series forecasting (see 
(Fraser 2008) for details). See Figure 17.11 for a sketch of the relationship between filtering, 
smoothing, and prediction. 
e MAP estimation This means computing arg maxz, P(Z1:7|X1:7), which is a most prob- 
able state sequence. In the context of HMMs, this is known as Viterbi decoding (see 


17.4.2 
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Section 17.4.4). Figure 17.10 illustrates the difference between filtering, smoothing and MAP 
decoding for the occasionally dishonest casino HMM. We see that the smoothed (offline) 
estimate is indeed smoother than the filtered (online) estimate. If we threshold the estimates 
at 0.5 and compare to the true sequence, we find that the filtered method makes 71 errors 
out of 300, and the smoothed method makes 49/300; the MAP path makes 60/300 errors. It is 
not surprising that smoothing makes fewer errors than Viterbi, since the optimal way to min- 
imize bit-error rate is to threshold the posterior marginals (see Section 5.7.1.1). Nevertheless, 
for some applications, we may prefer the Viterbi decoding, as we discuss in Section 17.4.4. 

e Posterior samples If there is more than one plausible interpretation of the data, it can be 
useful to sample from the posterior, Z1:7 ~ p(Z1-7|X1.7). These sample paths contain much 
more information than the sequence of marginals computed by smoothing. 

e Probability of the evidence We can compute the probability of the evidence, p(x,.7), 
by summing up over all hidden paths, p(x1.7) = ao p(Z1.7,X1:7). This can be used to 
classify sequences (e.g., if the HMM is used as a class conditional density), for model-based 
clustering, for anomaly detection, etc. 


The forwards algorithm 


We now describe how to recursively compute the filtered marginals, p(z:|x1.1) in an HMM. 
The algorithm has two steps. First comes the prediction step, in which we compute the 
one-step-ahead predictive density; this acts as the new prior for time t: 


pla = jlX12-1) = >> pla = ili = i)pla- = ixit) (17.44) 
a 


Next comes the update step, in which we absorb the observed data from time t using Bayes 
rule: 


alj) = pla = j|Xit) = pla = j|Xt X1t-1) (17.45) 
1 . i 
= z Plz = j, Xie) p(z = j|X1+-1) (17.46) 
t 


where the normalization constant is given by 


Z & pX) => r (ze = jlr) plz = j) (17.47) 


This process is known as the predict-update cycle. The distribution p(z;|x1.1) is called the 
(filtered) belief state at time t, and is a vector of K numbers, often denoted by œ+. In matrix- 
vector notation, we can write the update in the following simple form: 


a, x p, © (PTa) (17.48) 


where Yi(j) = p(x:z|z: = j) is the local evidence at time t, Y(i, j) = plz = j|z+-1 = i) is 
the transition matrix, and u © v is the Hadamard product, representing elementwise vector 
multiplication. See Algorithm 6 for the pseudo-code, and hmmFilter for some Matlab code. 
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In addition to computing the hidden states, we can use this algorithm to compute the log 
probability of the evidence: 


T T 
log p(xı:r|0) = 5 log p(xXz|X1-4-1) = 5 log Z: (17.49) 


t=1 


(We need to work in the log domain to avoid numerical underflow.) 


Algorithm 17.1: Forwards algorithm 


1 Input: Transition matrices (i, j) = p(z: = j|z:-1 = i), local evidence vectors 
Wil j) = p(xt|z = j), initial state distribution 7(j) = p(z1 = J); 

2 [a1, Z1| = normalize(w, © T) ; 

3 fort =2:T do 

4 i lar, Zi] = normalize(y, © (PT ar1)) ; 


5 Return @ı:r and log p(y1:r) = 9°, log Zs; 


6 Subroutine: [v, Z] = normalize(u) : Z = $2; uj; vj = uj/Z; 


The forwards-backwards algorithm 


In Section 17.4.2, we explained how to compute the filtered marginals p(z; = j|x1) using 
online inference. We now discuss how to compute the smoothed marginals, p(z = j|x1.7), 
using offline inference. 


Basic idea 


The key decomposition relies on the fact that we can break the chain into two parts, the past 
and the future, by conditioning on z: 


plizi = j\Xur) X p(ze = j, Xt+1:T|X1:) X plz: = §| X14) p(Xeq1-7 12 = j, Xr) (17.50) 
Let a;(j) £ p(z = j|x1) be the filtered belief state as before. Also, define 
BlI) = parra = j) (17.51) 


as the conditional likelihood of future evidence given that the hidden state at time t is j. 
(Note that this is not a probability distribution over states, since it does not need to satisfy 
X; 2:(7) = 1.) Finally, define 

n) = pla = j|xur) (17.52) 


as the desired smoothed posterior marginal. From Equation 17.50, we have 


wlj) x atlj) Bli) (17.53) 
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We have already described how to recursively compute the a’s in a left-to-right fashion in 
Section 17.4.2. We now describe how to recursively compute the (’s in a right-to-left fashion. If 
we have already computed 6+, we can compute (6;_, as follows: 

Bat) = p(xer|zi-1 = i) (17.54) 
= X pla = j eee = i) (17.55) 


J 


= So p(xepirl2e = j, 2-41, plz = j, Xizi- = 1) (17.56) 
j 


= So p(xepirl2e = j)p(Xt|2¢ = j, 22-47) p(%t = ja- = i) (17.57) 


j 
= J AIAI) (17.58) 
J 
We can write the resulting equation in matrix-vector form as 
Bi- = Y OB) (17.59) 
The base case is 
Br (i) = p(xr4itlzr = i) = p@lzr =i) =1 (17.60) 


which is the probability of a non-event. 

Having computed the forwards and backwards messages, we can combine them to compute 
wlj) x ar(j)6:(7). The overall algorithm is known as the forwards-backwards algorithm. 
The pseudo code is very similar to the forwards case; see hmmFwdBack for an implementation. 

We can think of this algorithm as passing “messages” from left to right, and then from right 
to left, and then combining them at each node. We will generalize this intuition in Section 20.2, 
when we discuss belief propagation. 


Two-slice smoothed marginals 


When we estimate the parameters of the transition matrix using EM (see Section 17.5), we will 
need to compute the expected number of transitions from state 7 to state j: 


T=1 T—1 
Ng = 5 a (Ize = i za = 7) [x17] = X pla = i, 241 = j|X1:T) (17.61) 
t=1 t=1 


The term p(z = i, 2441 = Jj|X1-r) is called a (smoothed) two-slice marginal, and can be 
computed as follows 


Erali) 2 pla = i 241 = jx) (17.62) 
Oo  p(24|X12)P( 241124, Xe41:7) (17.63) 

x p(Zt\X14)PXe41:7| 2, 241) P(Ze41| 2) (17.64) 

x p(Zt|X14)P(Xe41|2141) P(Xe42:7 12141) P(Ze41] Zt) (17.65) 

) 


a(t) G41 (Ibi (YJ) (17.66 
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In matrix-vector form, we have 


Ener X WO (ar (G41 O Bisa) (17.67) 


For another interpretation of these equations, see Section 20.2.4.3. 


Time and space complexity 


It is clear that a straightforward implementation of FB takes O(K?T) time, since we must 
perform a K x K matrix multiplication at each step. For some applications, such as speech 
recognition, K is very large, so the O(K?) term becomes prohibitive. Fortunately, if the 
transition matrix is sparse, we can reduce this substantially. For example, in a left-to-right 
transition matrix, the algorithm takes O(T K) time. 

In some cases, we can exploit special properties of the state space, even if the transition 
matrix is not sparse. In particular, suppose the states represent a discretization of an underlying 
continuous state-space, and the transition matrix has the form ~(i, j) x exp(—o?|z; — z;|), 
where z; is the continuous vector represented by state i. Then one can implement the forwards- 
backwards algorithm in O(T K log K) time. This is very useful for models with large state 
spaces. See Section 22.2.6.1 for details. 

In some cases, the bottleneck is memory, not time. The expected sufficient statistics needed 
by EM are D &:-1,4(4, j); this takes constant space (independent of T); however, to compute 
them, we need O(KT) working space, since we must store a, for t = 1,...,T until we do the 
backwards pass. It is possible to devise a simple divide-and-conquer algorithm that reduces the 
space complexity from O(KT) to O(K log T) at the cost of increasing the running time from 
O(K?T) to O(K?T log T): see (Binder et al. 1997; Zweig and Padmanabhan 2000) for details. 


The Viterbi algorithm 


The Viterbi algorithm (Viterbi 1967) can be used to compute the most probable sequence of 
states in a chain-structured graphical model, i.e., it can compute 


Zz* = arg max p(Z1-7|X1-7) (17.68) 
Zi:T 


This is equivalent to computing a shortest path through the trellis diagram in Figure 17.12, 
where the nodes are possible states at each time step, and the node and edge weights are log 
probabilities. That is, the weight of a path z1, z2,..., zr is given by 


T 
log m1 (21) + log 1 (21) + 5 [log Y(zt-1, 2+) + log $4(z¢)] (17.69) 


t=2 


MAP vs MPE 


Before discussing how the algorithm works, let us make one important remark: the (jointly) most 
probable sequence of states is not necessarily the same as the sequence of (marginally) most probable 
states. The former is given by Equation 17.68, and is what Viterbi computes, whereas the latter is 
given by the maximizer of the posterior marginals or MPM: 


Ż = (arg max p(21|X1-7),-.., arg max p(zr|xi:r)) (17.70) 
zı ZT 
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STATE 


OBSERVATION 


Figure 17.12 The trellis of states vs time for a Markov chain. Based on (Rabiner 1989). 


As a simple example of the difference, consider a chain with two time steps, defining the 
following joint: 


X2=0 | 0.04 0.3 | 0.34 
Xə=1 | 0.36 0.3 | 0.66 
| 04 06 | 


The joint MAP estimate is (0,1), whereas the sequence of marginal MPMs is (1, 1). 

The advantage of the joint MAP estimate is that is is always globally consistent. For example, 
suppose we are performing speech recognition and someones says “recognize speech’. This 
could be mis-heard as “wreck a nice beach’. Locally it may appear that “beach” is the most 
probable interpretation of that particular window of sound, but when we add the requirement 
that the data be explained by a single linguistically plausible path, this interpretation becomes 
less likely. 

On the other hand, the MPM estimates can be more robust (Marroquin et al. 1987). To see 
why, note that in Viterbi, when we estimate z+, we “max out” the other variables: 


* 
Zp =argmax max  pP(Z14—1,2t; Zt41:T|XT) (17.71) 
Zt 41:t-1,4t+1:T 


whereas we when we use forwards-backwards, we sum out the other variables: 
pla|xir) = 5 P(Z1:t-1, Zt, Zt+1:T|X1:T) (17.72) 
Z1:t-1,4t+1:T 
This makes the MPM in Equation 17.70 more robust, since we estimate each node averaging over 


its neighbors, rather than conditioning on a specific value of its neighbors.° 


6. In general, we may want to mix max and sum. For example, consider a joint distribution where we observe 
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Details of the algorithm 


It is tempting to think that we can implement Viterbi by just replacing the sum-operator in 
forwards-backwards with a max-operator. The former is called the sum-product, and the 
latter the max-product algorithm. If there is a unique mode, running max-product and then 
computing using Equation 17.70 will give the same result as using Equation 17.68 (Weiss and 
Freeman 2001b), but in general, it can lead to incorrect results if there are multiple equally 
probably joint assignments. The reasons is that each node breaks ties independently and hence 
may do so in a manner that is inconsistent with its neighbors. The Viterbi algorithm is therefore 
not quite as simple as replacing sum with max. In particular, the forwards pass does use max- 
product, but the backwards pass uses a traceback procedure to recover the most probable path 
through the trellis of states. Essentially, once z+ picks its most probable state, the previous nodes 
condition on this event, and therefore they will break ties consistently. 

In more detail, define 

S(j) S max p(tit—1, 2 = jlx) (17.73) 
This is the probability of ending up in state j at time t, given that we take the most probable 
path. The key insight is that the most probable path to state j at time t must consist of the 
most probable path to some other state į at time t — 1, followed by a transition from i to j. 
Hence 


iG) = maxs DVi AC) 07.4) 
We also keep track of the most likely previous state, for each possible state that we end up in: 


a(i) = argmax ði (i)Y(, j)pC) (17.75) 


That is, a¿(j) tells us the most likely previous state on the most probable path to z; = j. We 
initialize by setting 


01 (3) = Tjd C) (17.76) 
and we terminate by computing the most probable final state z7: 


Zp = arg max ôr (i) (17.77) 


We can then compute the most probable sequence of states using traceback: 
zi = M41 (2441) (17.78) 


As usual, we have to worry about numerical underflow. We are free to normalize the 6; terms 
at each step; this will not affect the maximum. However, unlike the forwards-backwards case, 


v and we want to query q; let n be the remaining nuisance variables. We define the MAP estimate as x7 = 
arg Maxx, se p(Xq,Xn|Xv), where we max over Xq and sum over Xn. By contrast, we define the MPE or 
most probable explanation as (xj, x7,) = arg maxx,,x,, P(Xq,Xn|Xv), where we max over both xg and xn. This 
terminology is due to (Pearl 1988), although it is not widely used outside the Bayes net literatire. Obviously MAP=MPE if 
n = Ú. However, if n Æ Ø, then summing out the nuisance variables can give different results than maxing them out. 
Summing out nuisance variables is more sensible, but computationally harder, because of the need to combine max and 


sum operations (Lerner and Parr 2001). 
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(a) (b) 


Figure 17.13 Illustration of Viterbi decoding in a simple HMM for speech recognition. (a) A 3-state HMM 
for a single phone. We are visualizing the state transition diagram. We assume the observations have been 
vector quantized into 7 possible symbols, C1, . . . , C7. Each state 21, z2, z3 has a different distribution over 
these symbols. Based on Figure 15.20 of (Russell and Norvig 2002). (b) Illustration of the Viterbi algorithm 
applied to this model, with data sequence C'l,C3,C4,C6. The columns represent time, and the rows 
represent states. An arrow from state i at t — 1 to state j at t is annotated with two numbers: the first 
is the probability of the 1 + j transition, and the second is the probability of generating observation x; 


from state j. The bold lines/ circles represent the most probable sequence of states. Based on Figure 24.27 
of (Russell and Norvig 1995). 


we can also easily work in the log domain. The key difference is that log max = max log, 
whereas log X` Æ }_ log. Hence we can use 
log &:(j) = max log p(21:.-1, 2% = j|X1:) (17.79) 


Z1:t—1 


= maxlogð,—ı(i) + log %(i, j) + log (9) (17.80) 


In the case of Gaussian observation models, this can result in a significant (constant factor) 
speedup, since computing log p(x;|z;) can be much faster than computing p(x;|z;) for a high- 
dimensional Gaussian. This is one reason why the Viterbi algorithm is widely used in the E step 
of EM (Section 17.5.2) when training large speech recognition systems based on HMMs. 


Example 


Figure 17.13 gives a worked example of the Viterbi algorithm, based on (Russell et al. 1995). 
Suppose we observe the discrete sequence of observations x1.4 = (C1, C3, C4, Cg), representing 
codebook entries in a vector-quantized version of a speech signal. The model starts in state 
zı. The probability of generating Cı in zı is 0.5, so we have 6,(1) = 0.5, and ôı (i) = 0 for 
all other states. Next we can self-transition to zı with probability 0.3, or transition to zə with 
proabability 0.7. If we end up in 21, the probability of generating C3 is 0.3; if we end up in 29, 


17.4.4.4 


17.4.4.5 
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the probability of generating C3 is 0.2. Hence we have 


62(1) = 61(1)W(1, 1)¢0(1) = 0.5 - 0.3 - 0.3 = 0.045 17.81) 
59(2) = 6,(1)wW(1,2)¢(2) = 0.5 - 0.7 - 0.2 = 0.07 (17.82) 


Thus state 2 is more probable at t = 2; see the second column of Figure 17.13(b). In time step 
3, we see that there are two paths into z2, from zı and from z2. The bold arrow indicates that 
the latter is more probable. Hence this is the only one we have to remember. The algorithm 
continues in this way until we have reached the end of the sequence. One we have reached the 
end, we can follow the black arrows back to recover the MAP path (which is 1,2,2,3). 


Time and space complexity 


The time complexity of Viterbi is clearly O(K?T) in general, and the space complexity 
is O(KT), both the same as forwards-backwards. If the transition matrix has the form 
y(i, j) x exp(—o?||z; — z;||?), where z; is the continuous vector represented by state i, we 
can implement Viterbi in O(T K) time, instead of O(T K log K) needed by forwards-backwards. 
See Section 22.2.6.1 for details. 


N-best list 


The Viterbi algorithm returns one of the most probable paths. It can be extended to return the 
top N paths (Schwarz and Chow 1990; Nilsson and Goldberger 2001). This is called the N-best 
list. Once can then use a discriminative method to rerank the paths based on global features 
derived from the fully observed state sequence (as well as the visible features). This technique 
is widely used in speech recognition. For example, consider the sentence “recognize speech’. It 
is possible that the most probable interpretation by the system of this acoustic signal is “wreck 
a nice speech”, or maybe “wreck a nice beach”. Maybe the correct interpretation is much lower 
down on the list. However, by using a re-ranking system, we may be able to improve the score 
of the correct interpretation based on a more global context. 

One problem with the N-best list is that often the top N paths are very similar to each other, 
rather than representing qualitatively different interpretations of the data. Instead we might want 
to generate a more diverse set of paths to more accurately represent posterior uncertainty. One 
way to do this is to sample paths from the posterior, as we discuss below. For some other ways 
to generate diverse MAP estimates, see e.g., (Yadollahpour et al. 201]; Kulesza and Taskar 2011). 


Forwards filtering, backwards sampling 
It is often useful to sample paths from the posterior: 


Zip ~ p(Z1:7|X1:7) (17.83) 


We can do this is as follow: run forwards backwards, to compute the two-slice smoothed posteri- 
ors, p(Z-1,2|X1-7); next compute the conditionals p(z;|z:-1, X1:r) by normalizing; sample from 
the initial pair of states, z] > ~ p(21,2|X1.7); finally, recursively sample 2% ~ p(z:|z/_1,X1:7). 
Note that the above solution requires a forwards-backwards pass, and then an additional 
forwards sampling pass. An alternative is to do the forwards pass, and then perform sampling 


17.5 


17.5.1 
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in the backwards pass. The key insight into how to do this is that we can write the joint from 
right to left using 


1 


P(Z1:7|X1:7) = p(zr|X1:7) JI plzi|zt+1;, X1:T) (17.84) 
t=T—1 


We can then sample z; given future sampled states using 
ze p(zelZegar, XT) = p(2e|2e41, ZT, X1:t Xir T) = p( 2/2741, X1:t) (17.85) 
The sampling distribution is given by 


pla = ilii = j,Xit) = p(%|241, X10, Seer) ii 
p p(Zt415 zt|X1:t+1) (17.87) 
P(zt+1lX1:++1) 
x POl 6AP, 24/11) (17.88) 
P(Ze41|X1:41) 


P(Xe41 2441) P( 241124, Xr) p(X) 


= (17.89) 
p(Ze41|X1:41) 
_ Perlitli, j)oli) 117.90) 
atı (j) 
The base case is 
zp ~ plr = 1|X1:7) = ar (i) (17.91) 


This algorithm forms the basis of blocked-Gibbs sampling methods for parameter inference, 
as we will see below. 


Learning for HMMs 


We now discuss how to estimate the parameters 9 = (n, A, B), where z(i) = p(z1 = i) is 
the initial state distribution, A(i, j) = p(z: = j|z:-1 = i) is the transition matrix, and B are 
the parameters of the class-conditional densities p(x,|z; = j). We first consider the case where 
Z1.7 is observed in the training set, and then the harder case where z 1.7 is hidden. 


Training with fully observed data 


If we observe the hidden state sequences, we can compute the MLEs for A and 7 exactly as in 
Section 17.2.2.1. If we use a conjugate prior, we can also easily compute the posterior. 

The details on how to estimate B depend on the form of the observation model. The 
situation is identical to fitting a generative classifier. For example, if each state has a multinoulli 
distribution associated with it, with parameters Bj; = p(X; = 1|z, = j), where l € {1,..., L} 
represents the observed symbol, the MLE is given by 


NX N Ti 
s jl xX r 
Bi = Wy” NS) Y Main =j, tit =l) (17.92) 


i=1 t=1 
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This result is quite intuitive: we simply add up the number of times we are in state 7 and we 
see a symbol l, and divide by the number of times we are in state j. 

Similarly, if each state has a Gaussian distribution associated with it, we have (from Sec- 
tion 4.2.4) the following MLEs: 


i, = S (Xx) f — Neûr hik 


(17.93) 


where the sufficient statistics are given by 


N TTi 
x = XOY Min = Fea (17.94) 


$= 1, f=1 
N Ti 

Gx), = YOY Mies k)xix, (17.95) 
4=1 t=1 


Analogous results can be derived for other kinds of distributions. One can also easily extend all 
of these results to compute MAP estimates, or even full posteriors over the parameters. 


EM for HMMs (the Baum-Welch algorithm) 


If the z, variables are not observed, we are in a situation analogous to fitting a mixture model. 
The most common approach is to use the EM algorithm to find the MLE or MAP parameters, 
although of course one could use other gradient-based methods (see e.g., (Baldi and Chauvin 
1994)). In this Section, we derive the EM algorithm. When applied to HMMs, this is also known 
as the Baum-Welch algorithm (Baum et al. 1970). 


E step 


It is straightforward to show that the expected complete data log likelihood is given by 


K K K 
Q(0,0%™) = X E [N}] log m + 2 [Njx] log A; (17.96) 
k=1 j=1 k=1 
N Iy K 
+ SO Y $ pla = kx, 0%) log p(x slor) (17.97) 
i=1 t=1 k=1 
where the expected counts are given by 
N 
[N] = > pla = k|xi 0”) (17.98) 
N T 
Na] = XOY plia = j, zit = bles, 0") (17.99) 


t=2: 
Ty 

IN] = XOX ples = jlx 0”) (17.100) 
t=1 
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These expected sufficient statistics can be computed by running the forwards-backwards algo- 
rithm on each sequence. In particular, this algorithm computes the following smoothed node 
and edge marginals: 


malj) = pla = jlx, 0) (17.101) 
G2G,8) = plea 5j, z = hla, 0) (17.102) 
M step 


Based on Section 11.3, we have that the M step for A and ~ is to just normalize the expected 
counts: 


7 z [N;k] . z [N}] 
Aj. = = i= (17.103) 
á Èw [N] 


N 
This result is quite intuitive: we simply add up the expected number of transitions from j to k, 
and divide by the expected number of times we transition from j to anything else. 
For a multinoulli observation model, the expected sufficient statistics are 


N Ti 


iMa] = Sod a= D) = DE (17.104) 


i=1 t=1 i=1 tra; =l 


The M step has the form 
: 2 [Mj] 
Bi = 5 
i z [N;] 


(17.105) 


This result is quite intuitive: we simply add up the expected number of times we are in state j 
and we see a symbol l, and divide by the expected number of times we are in state j. 
For a Gaussian observation model, the expected sufficient statistics are given by 


N Ti 
te) = XOY Yielk)xis (17.106) 


i=1 t=1 


4 


N 
a [E] = Sd veel xexl, (17.107) 


i=1 t=1 


The M step becomes 


~ ER] & _ s [(xx)z] — 2 [Nk] hy by, 
lk = JA X, = Te] (17.108) 


This can (and should) be regularized in the same way we regularize GMMs. 


Initialization 


As usual with EM, we must take care to ensure that we initialize the parameters carefully, to 
minimize the chance of getting stuck in poor local optima. There are several ways to do this, 
such as 
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e Use some fully labeled data to initialize the parameters. 


e Initially ignore the Markov dependencies, and estimate the observation parameters using the 
standard mixture model estimation methods, such as K-means or EM. 


e Randomly initialize the parameters, use multiple restarts, and pick the best solution. 


Techniques such as deterministic annealing (Ueda and Nakano 1998; Rao and Rose 2001) 
can help mitigate the effect of local minima. Also, just as K-means is often used to initialize 
EM for GMMs, so it is common to initialize EM for HMMs using Viterbi training, which 
means approximating the posterior over paths with the single most probable path. (This is not 
necessarily a good idea, since initially the parameters are often poorly estimated, so the Viterbi 
path will be fairly arbitrary. A safer option is to start training using forwards-backwards, and to 
switch to Viterbi near convergence.) 


Bayesian methods for “fitting” HMMs * 


EM returns a MAP estimate of the parameters. In this section, we briefly discuss some methods 
for Bayesian parameter estimation in HMMs. (These methods rely on material that we will cover 
later in the book.) 

One approach is to use variational Bayes EM (VBEM), which we discuss in general terms in 
Section 21.6. The details for the HMM case can be found in (MacKay 1997; Beal 2003), but 
the basic idea is this: The E step uses forwards-backwards, but where (roughly speaking) we 
plug in the posterior mean parameters instead of the MAP estimates. The M step updates the 
parameters of the conjugate posteriors, instead of updating the parameters themselves. 

An alternative to VBEM is to use MCMC. A particularly appealing algorithm is block Gibbs 
sampling, which we discuss in general terms in Section 24.2.8. The details for the HMM case 
can be found in (Fruhwirth-Schnatter 2007), but the basic idea is this: we sample z1.7 given 
the data and parameters using forwards-filtering, backwards-sampling, and we then sample the 
parameters from their posteriors, conditional on the sampled latent paths. This is simple to 
implement, but one does need to take care of unidentifiability (label switching), just as with 
mixture models (see Section 11.3.1). 


Discriminative training 


Sometimes HMMs are used as the class conditional density inside a generative classifier. In this 
case, p(x|y = c,@) can be computed using the forwards algorithm. We can easily maximize the 
joint likelihood i p(x;, yil0) by using EM (or some other method) to fit the HMM for each 
class-conditional density separately. 

However, we might like to find the parameters that maximize the conditional likelihood 


af p(yilO)p(xilyi, 9) 
l [p yilxi, 0 =| l : ue (17.109) 
ar (wl Xe plyi = cl0)p(xile, 0) 


This is more expensive than maximizing the joint likelihood, since the denominator couples all C 
class-conditional HMMs together. Furthermore, EM can no longer be used, and one must resort 
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to generic gradient based methods. Nevertheless, discriminative training can result in improved 
accuracies. The standard practice in speech recognition is to initially train the generative models 
separately using EM, and then to fine tune them discriminatively (Jelinek 1997). 


Model selection 


In HMMs, the two main model selection issues are: how many states, and what topology to use 
for the state transition diagram. We discuss both of these issues below. 


Choosing the number of hidden states 


Choosing the number of hidden states K in an HMM is analogous to the problem of choosing 
the number of mixture components. Here are some possible solutions: 


e Use grid-search over a range of K’s, using as an objective function cross-validated likelihood, 
the BIC score, or a variational lower bound to the log-marginal likelihood. 

e Use reversible jump MCMC. See (Fruhwirth-Schnatter 2007) for details. Note that this is very 
slow and is not widely used. 

e Use variational Bayes to “extinguish” unwanted components, by analogy to the GMM case 
discussed in Section 21.6.1.6. See (MacKay 1997; Beal 2003) for details. 

e Use an “infinite HMM”, which is based on the hierarchical Dirichlet process. See e.g., (Beal 
et al. 2002; Teh et al. 2006) for details. 


Structure learning 


The term structure learning in the context of HMMs refers to learning a sparse transition 
matrix. That is, we want to learn the structure of the state transition diagram, not the structure 
of the graphical model (which is fixed). A large number of heuristic methods have been proposed. 
Most alternate between parameter estimation and some kind of heuristic split merge method 
(see e.g., (Stolcke and Omohundro 1992)). 

Alternatively, one can pose the problem as MAP estimation using a minimum entropy prior, 
of the form 


p(Aj,:) x exp(—H (A;,:)) (17.110) 


This prior prefers states whose outgoing distribution is nearly deterministic, and hence has low 
entropy (Brand 1999). The corresponding M step cannot be solved in closed form, but numerical 
methods can be used. The trouble with this is that we might prune out all incoming transitions 
to a state, creating isolated “islands” in state-space. The infinite HMM presents an interesting 
alternative to these methods. See e.g., (Beal et al. 2002; Teh et al. 2006) for details. 


Generalizations of HMMs 


Many variants of the basic HMM model have been proposed. We briefly discuss some of them 
below. 
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Din D; Disi 


Figure 17.14 Encoding a hidden semi-Markov model as a DGM. D+ are deterministic duration counters. 


Variable duration (semi-Markov) HMMs 


In a standard HMM, the probability we remain in state 7 for exactly d steps is 
p(t; = d) = (1 — Ay) AZ « exp(dlog Ai) (17.111) 


where A;; is the self-loop probability. This is called the geometric distribution. However, this 
kind of exponentially decaying function of d is sometimes unrealistic. 

To allow for more general durations, one can use a semi-Markov model. It is called semi- 
Markov because to predict the next state, it is not sufficient to condition on the past state: we 
also need to know how long we've been in that state. When the state space is not observed 
directly, the result is called a hidden semi-Markov model (HSMM), a variable duration HMM, 
or an explicit duration HMM. 

HSMMs are widely used in many gene finding programs, since the length distribution of 
exons and introns is not geometric (see e.g., (Schweikerta et al. 2009)), and in some chip-Seq 
data analysis programs (see e.g., (Kuan et al. 2009)). 

HSMMs are useful not only because they can model the waiting time of each state more 
accurately, but also because they can model the distribution of a whole batch of observations at 
once, instead of assuming all observations are conditionally iid. That is, they can use likelihood 
models of the form p(xX¢:441|21 = k,d; = l), which generate | correlated observations if the 
duration in state k is for l time steps. This is useful for modeling data that is piecewise linear, 
or shows other local trends (Ostendorf et al. 1996). 


HSMM as augmented HMMs 


One way to represent a HSMM is to use the graphical model shown in Figure 17.14. (In this 
figure, we have assumed the observations are iid within each state, but this is not required, 
as mentioned above.) The D; € {0,1,...,D} node is a state duration counter, where D is 
the maximum duration of any state. When we first enter state j, we sample D, from the 
duration distribution for that state, D; ~ pC). Thereafer, D, deterministically counts down 


17.6.1.2 


17.6. Generalizations of HMMs 623 


p P p P 
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(a) 


Figure 17.15 (a) A Markov chain with n = 4 repeated states and self loops. (b) The resulting distribution 
over sequence lengths, for p = 0.99 and various n. Figure generated by hmmSelfLoopDist. 


until D; = 0. While D; > 0, the state z; is not allowed to change. When D; = 0, we make a 
stochastic transition to a new state. 
More precisely, we define the CPDs as follows: 


pj(d’) ifd=0 


p(D = d'|Di-1 =d, =j) = 1 ifd =d—landd>1 (17.112) 
0 otherwise 
1 if d > 0 and j = k 

pla = k|zt—1 = Ja Dizi = d) = Ajk ifd= 0 (17.113) 
0 otherwise 


Note that p;(d) could be represented as a table (a non-parametric approach) or as some kind 
of parametric distribution, such as a Gamma distribution. If p;(d) is a geometric distribution, 
this emulates a standard HMM. 

One can perform inference in this model by defining a mega-variable Y, = (D+, z;). However, 
this is rather inefficient, since D+ is deterministic. It is possible to marginalize D+ out, and derive 
special purpose inference procedures. See (Guedon 2003; Yu and Kobayashi 2006) for details. 
Unfortunately, all these methods take O(T K 2D) time, where T is the sequence length, K is 
the number of states, and D is the maximum duration of any state. 


Approximations to semi-Markov models 


A more efficient, but less flexible, way to model non-geometric waiting times is to replace each 
state with n new states, each with the same emission probabilities as the original state. For 
example, consider the model in Figure 17.15(a). Obviously the smallest sequence this can generate 
is of length n = 4. Any path of length d through the model has probability p’~"(1 — p)”; 
multiplying by the number of possible paths we find that the total probability of a path of length 
d is 

p(d) = F 7 i) rams Oe (17.114) 


n— 1 
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Figure 17.16 An example of an HHMM for an ASR system which can recognize 3 words. The top level 
represents bigram word probabilities. The middle level represents the phonetic spelling of each word. The 
bottom level represents the subphones of each phone. (It is traditional to represent a phone as a 3 state 
HMM, representing the beginning, middle and end.) Based on Figure 7.5 of Jurafsky and Martin 2000). 
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This is equivalent to the negative binomial distribution. By adjusting n and the self-loop 
probabilities p of each state, we can model a wide range of waiting times: see Figure 17.15(b). 

Let E be the number of expansions of each state needed to approximate p,(d). Forwards- 
backwards on this model takes O(T(KE)F;n) time, where Fin is the average number of 
predecessor states, compared to O(T K (F;;,,+D)) for the HSMM. For typical speech recognition 
applications, Fin ~ 3, D ~ 50, K ~ 10°, T ~ 10°. (Similar figures apply to problems such 
as gene finding, which also often uses HSMMs.) Since Fin + D >> EF;,,, the expanded state 
method is much faster than an HSMM. See Johnson 2005) for details. 


Hierarchical HMMs 


A hierarchical HMM (HHMM) (Fine et al. 1998) is an extension of the HMM that is designed to 
model domains with hierarchical structure. Figure 17.16 gives an example of an HHMM used in 
automatic speech recognition. The phone and subphone models can be “called” from different 
higher level contexts. We can always “flatten” an HHMM to a regular HMM, but a factored 
representation is often easier to interpret, and allows for more efficient inference and model 
fitting. 

HHMMs have been used in many application domains, e.g., speech recognition (Bilmes 2001), 
gene finding (Hu et al. 2000), plan recognition (Bui et al. 2002), monitoring transportation 
patterns (Liao et al. 2007), indoor robot localization (Theocharous et al. 2004), etc. HHMMs are 
less expressive than stochastic context free grammars (SCFGs), since they only allow hierarchies 
of bounded depth, but they support more efficient inference. In particular, inference in SCFGs 
(using the inside outside algorithm, JJurafsky and Martin 2008)) takes O (7?) whereas inference 
in an HHMM takes O(T) time (Murphy and Paskin 2001). 

We can represent an HHMM as a directed graphical model as shown in Figure 17.17. Q{ 
represents the state at time t and level £. A state transition at level £ is only “allowed” if the 
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Figure 17.17 An HHMM represented as a DGM. QÉ is the state at time t, level ¢; F = 1 if the HMM at 
level Z has finished (entered its exit state), otherwise F£ = 0. Shaded nodes are observed; the remaining 
nodes are hidden. We may optionally clamp F} = 1, where T is the length of the observation sequence, 
to ensure all models have finished by the end of the sequence. Source: Figure 2 of (Murphy and Paskin 
2001). 


chain at the level below has “finished”, as determined by the pf! node. (The chain below 
finishes when it chooses to enter its end state.) This mechanism ensures that higher level chains 
evolve more slowly than lower level chains, i.e., lower levels are nested within higher levels. 

A variable duration HMM can be thought of as a special case of an HHMM, where the top 
level is a deterministic counter, and the bottom level is a regular HMM, which can only change 
states once the counter has “timed out”. See (Murphy and Paskin 2001) for further details. 


Input-output HMMs 
It is straightforward to extend an HMM to handle inputs, as shown in Figure 17.18(a). This defines 


a conditional density model for sequences of the form 
P(Yur, Z1:7|U1-7, 0) (17.115) 


where u; is the input at time t; this is sometimes called a control signal. If the inputs and 
outputs are continuous, a typical parameterization would be 


plax, Zt—-1 = 1, 0) = Cat(z|S(Wiu:)) (17.116) 
D(yelXe, Zt = oF 0) = N(yi|V jut, x; ) (17.117) 


Thus the transition matrix is a logistic regression model whose parameters depend on the 
previous state. The observation model is a Gaussian whose parameters depend on the current 
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Zt-1 Zt 


Yi-1 Yı 


(a) (b) (c) 


Figure 17.18 (a) Input-output HMM. (b) First-order auto-regressive HMM. (c) A second-order buried Markov 
model. Depending on the value of the hidden variables, the effective graph structure between the com- 
ponents of the observed variables (i.e., the non-zero elements of the regression matrix and the precision 
matrix) can change, although this is not shown. 


state. The whole model can be thought of as a hidden version of a maximum entropy Markov 
model (Section 19.6.1). 

Conditional on the inputs u,.7 and the parameters 0, one can apply the standard forwards- 
backwards algorithm to estimate the hidden states. It is also straightforward to derive an EM 
algorithm to estimate the parameters (see (Bengio and Frasconi 1996) for details). 


Auto-regressive and buried HMMs 


The standard HMM assumes the observations are conditionally independent given the hidden 
state. In practice this is often not the case. However, it is straightforward to have direct arcs from 
X¿—1 to x; as well as from z; to x+, as in Figure 17.18(b). This is known as an auto-regressive 
HMM, or a regime switching Markov model. For continuous data, the observation model 
becomes 


p(xz|Xt-1, Zt = Ja 0) = N (xi|W;Xt-1 + Hj, x;) (17.118) 


This is a linear regression model, where the parameters are chosen according to the current 
hidden state. We can also consider higher-order extensions, where we condition on the last L 
observations: 


L 
P(e: pa; zt =j, 0) = N (xl 5 W 50X10 a Hj, Xj) (17.119) 
l=1 


Such models are widely used in econometrics (Hamilton 1990). Similar models can be defined 
for discrete observations. 

The AR-HMM essentially combines two Markov chains, one on the hidden variables, to capture 
long range dependencies, and one on the observed variables, to capture short range dependen- 
cies (Berchtold 1999). Since the X nodes are observed, the connections between them only 
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Figure 17.19 (a) A factorial HMM with 3 chains. (b) A coupled HMM with 3 chains. 


change the computation of the local evidence; inference can still be performed using the stan- 
dard forwards-backwards algorithm. Parameter estimation using EM is also straightforward: the 
E step is unchanged, as is the M step for the transition matrix. If we assume scalar observations 
for notational simplicty, the M step involves minimizing 


fai 
DE [zag e E ne aloo)? ogol) 07120) 


Focussing on the w terms, we see that this requires solving K weighted least squares problems: 


Jon) = OEA ara) 
J t 


where (j) = p(z = k|x1-r) is the smoothed posterior marginal. This is a weighted linear 
regression problem, where the design matrix has a Toeplitz form. This subproblem can be solved 
efficiently using the Levinson-Durbin method (Durbin and Koopman 2001). 

Buried Markov models generalize AR-HMMs by allowing the dependency structure between 
the observable nodes to change based on the hidden state, as in Figure 17.18(c). Such a model 
is called a dynamic Bayesian multi net, since it is a mixture of different networks. In the 
linear-Gaussian setting, we can change the structure of the of x;_; — x; arcs by using sparse 
regression matrices, W,, and we can change the structure of the connections within the 
components of x, by using sparse Gaussian graphical models, either directed or undirected. See 
(Bilmes 2000) for details. 


Factorial HMM 


An HMM represents the hidden state using a single discrete random variable z; € {1,..., K}. 
To represent 10 bits of information would require K = 2'° = 1024 states. By contrast, consider 
a distributed representation of the hidden state, where each ze + € {0,1} represents the c'th 
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bit of the tth hidden state. Now we can represent 10 bits using just 10 binary variables, as 
illustrated in Figure 17.19(a). This model is called a factorial HMM (Ghahramani and Jordan 
1997). The hope is that this kind of model could capture different aspects of a signal, e.g., one 
chain would represent speaking style, another the words that are being spoken. 

Unfortunately, conditioned on x+, all the hidden variables are correlated (due to explaining 
away the common observed child x+). This make exact state estimation intractable. However, 
we can derive efficient approximate inference algorithms, as we discuss in Section 21.4.1. 


Coupled HMM and the influence model 


If we have multiple related data streams, we can use a coupled HMM (Brand 1996), as illustrated 
in Figure 17.19(b). This is a series of HMMs where the state transitions depend on the states of 
neighboring chains. That is, we represent the joint conditional distribution as 


plzz) = | [ p(zcelee—-1) (17.122) 


P(Zct|Zt-1) = Pl Zet|Ze,t-1, Ze—1,t-1 Ze-+1,t—1) (17.123) 


This has been used for various tasks, such as audio-visual speech recognition (Nefian et al. 
2002) and modeling freeway traffic flows (Kwon and Murphy 2000). 

The trouble with the above model is that it requires O(C K*) parameters to specify, if there 
are C chains with K states per chain, because each state depends on its own past plus the 
past of its two neighbors. There is a closely related model, known as the influence model 
(Asavathiratham 2000), which uses fewer parameters. It models the joint conditional distribution 
as 


C 
P(zctlze—-1) = XO aceplzetlžet-1) (17.124) 
csl 


where X v Qc,c’ = 1 for each c. That is, we use a convex combination of pairwise transition 
matrices. The œc parameter specifies how much influence chain c has on chain c’. This 
model only takes O(C? + CK?) parameters to specify. Furthermore, it allows each chain to 
be influenced by all the other chains, not just its nearest neighbors. (Hence the corresponding 
graphical model is similar to Figure 17.19(b), except that each node has incoming edges from 
all the previous nodes.) This has been used for various tasks, such as modeling conversational 
interactions between people (Basu et al. 2001). 

Unfortunately, inference in both of these models takes O(T'(K©)?) time, since all the chains 
become fully correlated even if the interaction graph is sparse. Various approximate inference 
methods can be applied, as we discuss later. 


Dynamic Bayesian networks (DBNs) 


A dynamic Bayesian network is just a way to represent a stochastic process using a directed 
graphical model.’ Note that the network is not dynamic (the structure and parameters are fixed), 


7. The acronym DBN can stand for either “dynamic Bayesian network” or “deep belief network” (Section 28.1) depending 
on the context. Geoff Hinton (who invented the term “deep belief network”) has suggested the acronyms DyBN and 
DeeBN to avoid this ambiguity. 
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Figure 17.20 The BATnet DBN. The transient nodes are only shown for the second slice, to minimize 
clutter. The dotted lines can be ignored. Used with kind permission of Daphne Koller. 


rather it is a network representation of a dynamical system. All of the HMM variants we have 
seen above could be considered to be DBNs. However, we prefer to reserve the term “DBN” 
for graph structures that are more “irregular” and problem-specific. An example is shown in 
Figure 17.20, which is a DBN designed to monitor the state of a simulated autonomous car 
known as the “Bayesian Automated Taxi”, or “BATmobile” (Forbes et al. 1995). 

Defining DBNs is straightforward: you just need to specify the structure of the first time-slice, 
the structure between two time-slices, and the form of the CPDs. Learning is also easy. The 
main problem is that exact inference can be computationally expensive, because all the hidden 
variables become correlated over time (this is known as entanglement — see e.g., (Koller and 
Friedman 2009, Sec. 15.2.4) for details). Thus a sparse graph does not necessarily result in 
tractable exact inference. However, later we will see algorithms that can exploit the graph 
structure for efficient approximate inference. 


Exercises 


Exercise 17.1 Derivation of Q function for HMM 
Derive Equation 17.97. 


Exercise 17.2 Two filter approach to smoothing in HMMs 


Assuming that II;(i) = p(.S; = i) > 0 for all ¿ and t, derive a recursive algorithm for updating r:(i) = 
p(St = 1|Xt41.7). Hint: it should be very similar to the standard forwards algorithm, but using a time- 
reversed transition matrix. Then show how to compute the posterior marginals y(i) = p(.S¢ = i|x1:7) 
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from the backwards filtered messages r:(i), the forwards filtered messages a(i), and the stationary 
distribution II;(7). 


Exercise 17.3 EM for for HMMs with mixture of Gaussian observations 


Consider an HMM where the observation model has the form 


pala = 5,0) = D> win Mel use, Bye) (17.125) 
k 


e Draw the DGM. 
e Derive the E step. 
e Derive the M step. 


Exercise 17.4 EM for for HMMs with tied mixtures 


In many applications, it is common that the observations are high-dimensional vectors (e.g., in speech 
recognition, x; is often a vector of cepstral coefficients and their derivatives, so x+ € R9), so estimating a 
full covariance matrix for KM values (where M is the number of mixture components per hidden state), 
as in Exercise 17.3, requires a lot of data. An alternative is to use just M Gaussians, rather than M K 
Gaussians, and to let the state influence the mixing weights but not the means and covariances. This is 
called a semi-continuous HMM or tied-mixture HMM. 


e Draw the corresponding graphical model. 
e Derive the E step. 
e Derive the M step. 


18.1 


State space models 


Introduction 


A state space model or SSM is just like an HMM, except the hidden states are continuous. The 
model can be written in the following generic form: 


Ze = g(Ut,Zt-1, €t) (18.1) 
Yt = h(ze, uz, ô+) (18.2) 


where z+ is the hidden state, u; is an optional input or control signal, y+ is the observation, g 
is the transition model, h is the observation model, €+ is the system noise at time t, and ô; 
is the observation noise at time t. We assume that all parameters of the model, 0, are known; 
if not, they can be included into the hidden state, as we discuss below. 


One of the primary goals in using SSMs is to recursively estimate the belief state, p(Zz|y1-2, U1-2, 


(Note: we will often drop the conditioning on u and @ for brevity.) We will discuss algorithms for 
this later in this chapter. We will also discuss how to convert our beliefs about the hidden state 
into predictions about future observables by computing the posterior predictive p(y:41|y1-z)- 

An important special case of an SSM is where all the CPDs are linear-Gaussian. In other 
words, we assume 


e The transition model is a linear function 

Z = Azzt—1 + Bru; + & (18.3) 
e The observation model is a linear function 

yı = Cize + Diy + Ôt (18.4) 
e The system noise is Gaussian 

e ~ N(0, Q:) (18.5) 
e The observation noise is Gaussian 

6, ~ N(0, R4) (18.6) 


This model is called a linear-Gaussian SSM (LG-SSM) or a linear dynamical system (LDS). 
If the parameters 0, = (A+, Bi, C+, Di, Qi, R;) are independent of time, the model is called 
stationary. 


18.2 
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Figure 18.1 Illustration of Kalman filtering and smoothing. (a) Observations (green cirles) are generated 
by an object moving to the right (true location denoted by black squares). (b) Filtered estimated is shown 
by dotted red line. Red cross is the posterior mean, blue circles are 95% confidence ellipses derived from 
the posterior covariance. For clarity, we only plot the ellipses every other time step. (c) Same as (b), but 
using offline Kalman smoothing. Figure generated by kalmanTrackingDemo. 


The LG-SSM is important because it supports exact inference, as we will see. In particular, 
if the initial belief state is Gaussian, p(z1) = N (41), ¥1ļ0), then all subsequent belief states 
will also be Gaussian; we will denote them by p(z:|y1:+) = N (My, Det). (The notation p- 
denotes E [z;|y1.,], and similarly for Xyz; thus j4,,) denotes the prior for zı before we have 
seen any data. For brevity we will denote the posterior belief states using 44 = p, and 
Xi = X4) We can compute these quantities efficiently using the celebrated Kalman filter, 
as we show in Section 18.3.1. But before discussing algorithms, we discuss some important 
applications. 


Applications of SSMs 


SSMs have many applications, some of which we discuss in the sections below. We mostly 
focus on LG-SSMs, for simplicity, although non-linear and/or non-Gaussian SSMs are even more 
widely used. 


SSMs for object tracking 


One of the earliest applications of Kalman filtering was for tracking objects, such as airplanes 
and missiles, from noisy measurements, such as radar. Here we give a simplified example to 
illustrate the key ideas. Consider an object moving in a 2D plane. Let z,, and zə; be the 
horizontal and vertical locations of the object, and 21, and 22; be the corresponding velocity. 
We can represent this as a state vector z; € Rf as follows: 


zr = (zi Z2t Zt Zot) $ (18.7) 
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Let us assume that the object is moving at constant velocity, but is “perturbed” by random 
Gaussian noise (e.g., due to the wind). Thus we can model the system dynamics as follows: 


Ze = Agwite (18.8) 
Zit 1 0A 0 Z1,t-1 Elt 
Zat = 01 0 A 224-1 €2t 
wl 10 O12 oh aap | | ese ee 
Zot 00 0 1 224-1 €4t 


where e ~ N (0, Q) is the system noise, and A is the sampling period. This says that the 
new location z; is the old location z;4-; plus A times the old velocity 2; ,~1, plus random 
noise, €j for j = 1 : 2. Also, the new velocity 2; is the old velocity 2;,-1 plus random 
noise, €;¢, for j = 3: 4. This is called a random accelerations model, since the object moves 
according to Newton's laws, but is subject to random changes in velocity. 

Now suppose that we can observe the location of the object but not its velocity. Let y; € R? 
represent our observation, which we assume is subject to Gaussian noise. We can model this as 
follows: 


ye = Cy + (18.10) 
Zit Ôi 
Yit 1 000 Z2t Oot 

= at | 18.11 

a ({ 1 0 0 Zit EYS ( ) 
224 Oat 


where 6; ~ N (0, R) is the measurement noise. 

Finally, we need to specify our initial (prior) beliefs about the state of the object, p(z). We 
will assume this is a Gaussian, p(z1) = N(21|M1)0, ¥1]0). We can represent prior ignorance by 
making %1)9 suitably “broad”, e.g, X1jo = oo]. We have now fully specified the model and can 
perform sequential Bayesian updating to compute p(z;|y1.;) using an algorithm known as the 
Kalman filter, to be described in Section 18.3.1. 

Figure 18.1(a) gives an example. The object moves to the right and generates an observation 
at each time step (think of “blips” on a radar screen). We observe these blips and filter out 
the noise by using the Kalman filter. At every step, we have p(z:|y1:+), from which we can 
compute p(212, Zz2t|y1:+) by marginalizing out the dimensions corresponding to the velocities. 
(This is easy to do since the posterior is Gaussian.) Our “best guess” about the location of the 
object is the posterior mean, E|z+|y1:+], denoted as a red cross in Figure 18.1(b). Our uncertainty 
associated with this is represented as an ellipse, which contains 95% of the probability mass. We 
see that our uncertainty goes down over time, as the effects of the initial uncertainty get “washed 
out”. We also see that the estimated trajectory has “filtered out” some of the noise. To obtain 
the much smoother plot in Figure 18.1(c), we need to use the Kalman smoother, which computes 
p(zily1-7); this depends on “future” as well as “past” data, as discussed in Section 18.3.2. 


Robotic SLAM 


Consider a robot moving around an unknown 2d world. It needs to learn a map and keep 
track of its location within that map. This problem is known as simultaneous localization and 
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Figure 18.2 Illustration of graphical model underlying SLAM. L’ is the fixed location of landmark i, x; 
is the location of the robot, and y+ is the observation. In this trace, the robot sees landmarks 1 and 2 at 
time step 1, then just landmark 2, then just landmark 1, etc. Based on Figure 15.A.3 of (Koller and Friedman 
2009). 


Robot pose 


(a) (b) 


Figure 18.3 Illustration of the SLAM problem. (a) A robot starts at the top left and moves clockwise in a 
circle back to where it started. We see how the posterior uncertainty about the robot's location increases 
and then decreases as it returns to a familar location, closing the loop. If we performed smoothing, this 
new information would propagate backwards in time to disambiguate the entire trajectory. (b) We show the 
precision matrix, representing sparse correlations between the landmarks, and between the landmarks and 
the robot’s position (pose). This sparse precision matrix can be visualized as a Gaussian graphical model, 
as shown. Source: Figure 15.A.3 of (Koller and Friedman 2009) . Used with kind permission of Daphne 
Koller. 
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mapping, or SLAM for short, and is widely used in mobile robotics, as well as other applications 
such as indoor navigation using cellphones (since GPS does not work inside buildings). 

Let us assume we can represent the map as the 2d locations of a fixed set of K landmarks, 
denote them by L!,... , L* (each is a vector in R?). For simplicity, we will assume these are 
uniquely identifiable. Let x; represent the unknown location of the robot at time t. We define 
the state space to be z; = Ge LE ); we assume the landmarks are static, so their motion 
model is a constant, and they have no system noise. If y; measures the distance from x; to 
the set of closest landmarks, then the robot can update its estimate of the landmark locations 
based on what it sees. Figure 18.2 shows the corresponding graphical model for the case where 
K = 2, and where on the first step it sees landmarks 1 and 2, then just landmark 2, then just 
landmark 1, etc. 

If we assume the observation model p(y;|z:, L) is linear-Gaussian, and we use a Gaussian 
motion model for p(x;|x:—1, Uz), we can use a Kalman filter to maintain our belief state about 
the location of the robot and the location of the landmarks (Smith and Cheeseman 1986; Choset 
and Nagatani 2001). 

Over time, the uncertainty in the robot's location will increase, due to wheel slippage etc., 
but when the robot returns to a familiar location, its uncertainty will decrease again. This is 
called closing the loop, and is illustrated in Figure 18.3(a), where we see the uncertainty ellipses, 
representing cov [X;|y1-4, U1-z], grow and then shrink. (Note that in this section, we assume that 
a human is joysticking the robot through the environment, so uj,.; is given as input, i.e., we do 
not address the decision-theoretic issue of choosing where to explore.) 

Since the belief state is Gaussian, we can visualize the posterior covariance matrix X;. Ac- 
tually, it is more interesting to visualize the posterior precision matrix, A; = »;", since that 
is fairly sparse, as shown in Figure 18.3(b). The reason for this is that zeros in the precision 
matrix correspond to absent edges in the corresponding undirected Gaussian graphical model 
(see Section 19.4.4). Initially all the landmarks are uncorrelated (assuming we have a diagonal 
prior on L), so the GGM is a disconnected graph, and A, is diagonal. However, as the robot 
moves about, it will induce correlation between nearby landmarks. Intuitively this is because the 
robot is estimating its position based on distance to the landmarks, but the landmarks’ locations 
are being estimated based on the robot’s position, so they all become inter-dependent. This can 
be seen more clearly from the graphical model in Figure 18.2: it is clear that Lt and L? are not 
d-separated by y;.;, because there is a path between them via the unknown sequence of x}., 
nodes. As a consequence of the precision matrix becoming denser, exact inference takes O(K°) 
time. (This is an example of the entanglement problem for inference in DBNs.) This prevents 
the method from being applied to large maps. 

There are two main solutions to this problem. The first is to notice that the correlation pattern 
moves along with the location of the robot (see Figure 18.3(b)). The remaining correlations 
become weaker over time. Consequently we can dynamically “prune out” weak edges from 
the GGM using a technique called the thin junction tree filter (Paskin 2003) (junction trees are 
explained in Section 20.4). 

A second approach is to notice that, conditional on knowing the robot’s path, x;.., the 
landmark locations are independent. That is, p(L|x1-4, Y1:t) = TI% pL xit, yit). This 
forms the basis of a method known as FastSLAM, which combines Kalman filtering and particle 
filtering, as discussed in Section 23.6.3. 

(Thrun et al. 2006) provides a more detailed account of SLAM and mobile robotics. 
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Figure 18.4 (a) A dynamic generalization of linear regression. (b) Illustration of the recursive least squares 
algorithm applied to the model p(y|x, 0) = N (y|wo + wix,o7). We plot the marginal posterior of wo 
and wi vs number of data points. (Error bars represent E [w,|yi:¢] + \/ var [w;|y1:t]) After seeing all 
the data, we converge to the offline ML (least squares) solution, represented by the horizontal lines. Figure 
generated by linregOnlineDemoKalman. 


Online parameter learning using recursive least squares 


We can perform online Bayesian inference for the parameters of various statistical models using 
SSMs. In this section, we focus on linear regression; in Section 18.5.3.2, we discuss logistic 
regression. 

The basic idea is to let the hidden state represent the regression parameters, and to let the 
(time-varying) observation model represent the current data vector. In more detail, define the 
prior to be p(@) = N (0|0o, Xo). (if we want to do online ML estimation, we can just set 
Xo = ool.) Let the hidden state be z; = 0; if we assume the regression parameters do not 
change, we can set A; = I and Q; = OI, so 


p(O,|O,-1) = N(O,|O¢_1, 01) = ôo, (0+) (18.12) 


(If we do let the parameters change over time, we get a so-called dynamic linear model 
(Harvey 1990; West and Harrison 1997; Petris et al. 2009).) Let C; = xT, and R; = o”, so the 
(non-stationary) observation model has the form 


N (yt|Cizt, Re) = N (yilxp 62, 07) (18.13) 


Applying the Kalman filter to this model provides a way to update our posterior beliefs about 
the parameters as the data streams in. This is known as the recursive least squares or RLS 
algorithm. 

We can derive an explicit form for the updates as follows. In Section 18.3.1, we show that the 
Kalman update for the posterior mean has the form 


Hi = Athi + Ke(ye — CtArh) (18.14) 
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where K; is known as the Kalman gain matrix. Based on Equation 18.39, one can show that 
K; = HC Rs". In this context, we have K; = ©;x;/o7. Hence the update for the 
parameters becomes 


en Ok 1 A 
0: = Oi + za tle (ue = x? 6,_1)X: (18.15) 


If we approximate Xit- with 7], we recover the least mean squares or LMS algorithm, 
discussed in Section 8.5.3. In LMS, we need to specify how to adapt the update parameter 
mi to ensure convergence to the MLE. Furthermore, the algorithm may take multiple passes 
through the data. By contrast, the RLS algorithm automatically performs step-size adaptation, 
and converges to the optimal posterior in one pass over the data. See Figure 18.4 for an example. 


SSM for time series forecasting * 


SSMs are very well suited for time-series forecasting, as we explain below. We focus on the case 
of scalar (one dimensional) time series, for simplicity. Our presentation is based on (Varian 2011). 
See also (Aoki 1987; Harvey 1990; West and Harrison 1997; Durbin and Koopman 2001; Petris 
et al. 2009; Prado and West 2010) for good books on this topic. 

At first sight, it might not be apparent why SSMs are useful, since the goal in forecasting is 
to predict future visible variables, not to estimate hidden states of some system. Indeed, most 
classical methods for time series forecasting are just functions of the form §41 = f(y1-4, 9), 
where hidden variables play no role (see Section 18.2.4.4). The idea in the state-space approach to 
time series is to create a generative model of the data in terms of latent processes, which capture 
different aspects of the signal. We can then integrate out the hidden variables to compute the 
posterior predictive of the visibles. 

Since the model is linear-Gaussian, we can just add these processes together to explain the 
observed data. This is called a structural time series model. Below we explain some of the 


basic building blocks. 


Local level model 


The simplest latent process is known as the local level model, which has the form 


Ye = ate, € ~N(0,R) (18.16) 
a, = mite, &~N(0,Q) (18.17) 


where the hidden state is just z; = as. This model asserts that the observed data y, € R is 
equal to some unknown level term a; € R, plus observation noise with variance R. In addition, 
the level a; evolves over time subject to system noise with variance Q. See Figure 18.5 for some 
examples. 
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Figure 18.5 (a) Local level model. (b) Sample output, for a9 = 10. Black solid line: Q = 0, R = 1 
(deterministic system, noisy observations). Red dotted line: Q = 0.1, R = 0 (noisy system, deterministic 
observation). Blue dot-dash line: Q = 0.1, R = 1 (noisy system and observations). Figure generated by 
ssmTimeSeriesSimple. 


local trend, a=10.000, b=1.000 
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Figure 18.6 (a) Local Trend. (b) Sample output, for ao = 10, bo = 1. Color code as in Figure 18.5. Figure 
generated by ssmTimeSeriesSimple. 


Local linear trend 


Many time series exhibit linear trends upwards or downwards, at least locally. We can model 
this by letting the level a; change by an amount b; at each step as follows: 


Ye = ate, ef ~N(0,R) (18.18) 
a, = ai +b tef, ef ~N(0,Qz) (18.19) 
b = bite, &~+~N(0,Qs) (18.20) 


See Figure 18.6(a). We can write this in standard form by defining z; = (a+, b+) and 


A= c i ,C=(1 0), Q= & a) (18.21) 


When Q, = 0, we have b, = bo, which is some constant defining the slope of the line. If in 
addition we have Qa = 0, we have a; = a;_1 + bot. Unrolling this, we have a; = ag + bot, and 
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seasonal model, s=4, a=0.000, b=0.000 


Figure 18.7 (a) Seasonal model. (b) Sample output, for a9 = bo = 0, co = (1,1, 1), with a period of 4. 
Color code as in Figure 18.5. Figure generated by ssmTimeSeriesSimple. 


hence E [y:|y1-4-1] = ao + tbo. This is thus a generalization of the classic constant linear trend 
model, an example of which is shown in the black line of Figure 18.6(b). 


Seasonality 


Many time series fluctuate periodically, as illustrated in Figure 18.7(b). This can be modeled by 
adding a latent process consisting of a series offset terms, c}, which sum to zero (on average) 
over a complete cycle of S steps: 


S-1 
a=- > este, e€ ~N(0,Qc) (18.22) 
s=1 


See Figure 18.7(a) for the graphical model for the case S = 4 (we only need 3 seasonal vari- 
able because of the sum-to-zero constraint). Writing this in standard LG-SSM form is left to 
Exercise 18.2. 


ARMA models * 


The classical approach to time-series forecasting is based on ARMA models. “ARMA” stands for 
auto-regressive moving-average, and refers to a model of the form 


p q 
te =X aisti +Y bjw + v1 (18.23) 
i=1 j=1 


where v+, w; ~ N (0,1) are independent Gaussian noise terms. If q = 0, we have a pure AR 
model, where x, L Tilti- for i < t — p. For example, if p = 1, we have the AR(1) model 
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Figure 18.8 (a) An AR() model. (b) An MA() model represented as a bi-directed graph. (c) An ARMA(,1) 
model. Source: Figure 5.14 of (Choi 2011). Used with kind permission of Myung Choi. 


shown in Figure 18.8(a). (The v; nodes are implicit in the Gaussian CPD for x+.) This is just a 
first-order Markov chain. If p = 0, we have a pure MA model, where x, L x;, for i < t— q. 
For example, if q = 1, we have the MA() model shown in Figure 18.8(b). Here the w; nodes are 
hidden common causes, which induces dependencies between adjacent time steps. This models 
short-range correlation. If p = q = 1, we get the ARMA(L,1) model shown in Figure 18.8(c), which 
captures correlation at short and long time scales. 

It turns out that ARMA models can be represented as SSMs, as explained in (Aoki 1987; Harvey 
1990; West and Harrison 1997; Durbin and Koopman 2001; Petris et al. 2009; Prado and West 
2010). However, the structural approach to time series is often easier to understand than the 
ARMA approach. In addition, it allows the parameters to evolve over time, which makes the 
models more adaptive to non-stationarity. 


Inference in LG-SSM 


In this section, we discuss exact inference in LG-SSM models. We first consider the online case, 
which is analogous to the forwards algorithm for HMMs. We then consider the offline case, 
which is analogous to the forwards-backwards algorithm for HMMs. 


The Kalman filtering algorithm 


The Kalman filter is an algorithm for exact Bayesian filtering for linear-Gaussian state space 
models. We will represent the marginal posterior at time t by 


p(Zel¥ia; Wie) = N (Ztl hi, Et) (18.24) 


Since everything is Gaussian, we can perform the prediction and update steps in closed form, 
as we explain below. The resulting algorithm is the Gaussian analog of the HMM filter in 
Section 17.4.2. 
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Prediction step 


The prediction step is straightforward to derive: 


P(Z\¥iz—-1,U14) = [Neda + Ban, QN æ tly, Se—-1)dz,_1 (18.25) 
N (Zil Hiji- Dele—-1) (18.26) 

h1 Ê Ar + Biu (18.27) 

Xi- S AD AT +Q (18.28) 


Measurement step 
The measurement step can be computed using Bayes rule, as follows 
P(Zt|Yt,Vi:t—1; Ure) X p(ye|%e, We) P(Zelyie-1, Ure) (18.29) 


In Section 18.3.1.6, we show that this is given by 


p(y ty) = N (Zil Hi, £) (18.30) 
Me = Hyi + Kerr (18.31) 
5 = (I = K,Cy) Sy-1 (18.32) 


where r; is the residual or innovation, given by the difference between our predicted observa- 
tion and the actual observation: 


yi — Yi (18.33) 
t [yey 1-1; Ure] = Cemyy_1 + Deux (18.34) 


r; 


Yt 


l> [I> 


and K; is the Kalman gain matrix, given by 


Ky ê YCS! (18.35) 
where 

S, = cov [rilyie—1, urt] (18.36) 

= E[(Cyz + ôi — f+) (Cize + 6: — Şe) Yit- u1:] (18.37) 

= CyYy-1C7 + Ry (18.38) 


where 6; ~ N (0, R+) is an observation noise term which is independent of all other noise 
sources. Note that by using the matrix inversion lemma, the Kalman gain matrix can also be 
written as 


Ky = Dypt-1 C7 (CXy4-1C* +R) = (Zp), + CRC) CTR! (18.39) 


t|t—1 


We now have all the quantities we need to implement the algorithm; see kalmanFilter for 
some Matlab code. 

Let us try to make sense of these equations. In particular, consider the equation for the 
mean update: p, = (yj,-; + Kers. This says that the new mean is the old mean plus a 
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correction factor, which is K; times the error signal r;. The amount of weight placed on the 
error signal depends on the Kalman gain matrix. If C; = I, then Ky = Y4,_,S; 1, which 
is the ratio between the covariance of the prior (from the dynamic model) and the covariance 
of the measurement error. If we have a strong prior and/or very noisy sensors, |K;| will be 
small, and we will place little weight on the correction term. Conversely, if we have a weak prior 
and/or high precision sensors, then |K;| will be large, and we will place a lot of weight on the 
correction term. 


Marginal likelihood 


As a byproduct of the algorithm, we can also compute the log-likelihood of the sequence using 


log p(yxr|ur) = X log p(yelyi:e-1, une) (18.40) 
t 
where 
PYY it-1, Urt) = N (yi|Cthi-1: Se) (18.41) 


Posterior predictive 


The one-step-ahead posterior predictive density for the observations can be computed as follows 


plyt|¥iz-1,Ui2) = J NOC BIN (ea tee, Baader (18.42) 
= N(yilCuy,1,CEZy2-1C7 +R) (18.43) 


This is useful for time series forecasting. 


Computational issues 


There are two dominant costs in the Kalman filter: the matrix inversion to compute the Kalman 
gain matrix, K;, which takes O(|y:|°) time; and the matrix-matrix multiply to compute ¥;, 
which takes O(|z|*) time. In some applications (e.g., robotic mapping), we have |z| > |yz|, so 
the latter cost dominates. However, in such cases, we can sometimes use sparse approximations 
(see (Thrun et al. 2006)). 

In cases where |y;| >> |z;|, we can precompute K,, since, suprisingly, it does not depend on 
the actual observations y1.; (an unusual property that is specific to linear Gaussian systems). 
The iterative equations for updating ®©; are called the Ricatti equations, and for time invariant 
systems (i.e., where 0; = 0), they converge to a fixed point. This steady state solution can then 
be used instead of using a time-specific gain matrix. 

In practice, more sophisticated implementations of the Kalman filter should be used, for rea- 
sons of numerical stability. One approach is the information filter, which recursively updates 
the canonical parameters of the Gaussian, A; = ©; * and n, = Apy, instead of the moment 
parameters. Another approach is the square root filter, which works with the Cholesky de- 
composition or the U,D,U; decomposition of ©. This is much more numerically stable than 
directly updating &,. Further details can be found at http: //www.cs.unc.edu/~welch/kal 
man/ and in various books, such as (Simon 2006). 
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Derivation * 


We now derive the Kalman filter equations. For notational simplicity, we will ignore the input 
terms u,.;. From Bayes rule for Gaussians (Equation 4.125), we have that the posterior precision 
is given by 


Dt = B+ CR; Cc: (18.44) 


t|t—1 
From the matrix inversion lemma (Equation 4.106) we can rewrite this as 
SY = Biti Dye-1Cz (Re + Cie- Cz) CE (18.45) 
i= KC) it-1 (18.46) 
From Bayes rule for Gaussians (Equation 4.125), the posterior mean is given by 


Me = XOR lye t DiE paii (18.47) 


We will now massage this into the form stated earlier. Applying the second matrix inversion 
lemma (Equation 4.107) to the first term of Equation 18.47 we have 


EGR y: = (Sa + CPR; 'C,)-1C,R;'y: (18.48) 
= E10] (Ry + C,Ey2-1C7)~"y, = Keys (18.49) 


Now applying the matrix inversion lemma (Equation 4.106) to the second term of Equation 18.47 
we have 


EE pp Mee (18.50) 
= (Bye +t CPR, Cr) Dye Maes (18.51) 
= [Eip — Eyer C7 (Re + CP Eip- C7 Cr Eie] Deitel (18.52) 
= (Seer — Ke CP Dye) Sy Maye (18.53) 
= Mya — KiC} Miei (18.54) 


Putting the two together we get 


Me = Mea + Ki (y: — Cihat) (18.55) 


The Kalman smoothing algorithm 


In Section 18.3.1, we described the Kalman filter, which sequentially computes p(z+|y1:+) for each 
t. This is useful for online inference problems, such as tracking. However, in an offline setting, 
we can wait until all the data has arrived, and then compute p(z:|y1.7). By conditioning 
on past and future data, our uncertainty will be significantly reduced. This is illustrated in 
Figure 18.1(c), where we see that the posterior covariance ellipsoids are smaller for the smoothed 
trajectory than for the filtered trajectory. (The ellipsoids are larger at the beginning and end of 
the trajectory, since states near the boundary do not have as many useful neighbors from which 
to borrow information.) 
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We now explain how to compute the smoothed estimates, using an algorithm called the 
RTS smoother, named after its inventors, Rauch, Tung and Striebel (Rauch et al. 1965). It is 
also known as the Kalman smoothing algorithm. The algorithm is analogous to the forwards- 
backwards algorithm for HMMs, although there are some small differences which we discuss 
below. 


Algorithm 


Kalman filtering can be regarded as message passing on a graph, from left to right. When the 
messages have reached the end of the graph, we have successfully computed p(zr|y1.7). Now 
we work backwards, from right to left, sending information from the future back to the past, 
and them combining the two information sources. The question is: how do we compute these 
backwards equations? We first give the equations, then the derivation. 


We have 
p(Ztlyir) = N (Myr, Eir) (18.56) 
Har = Hiet Ji(Megapr = Hesrjt) (18.57) 
Dar = Veet ulr Dey re) Je (18.58) 
J, = Ep A aE a (18.59) 


where J; is the backwards Kalman gain matrix. The algorithm can be initialized from pp) 
and 77 from the Kalman filter. Note that this backwards pass does not need access to the 
data, that is, it does not need y.7. This allows us to “throw away” potentially high dimensional 
observation vectors, and just keep the filtered belief states, which usually requires less memory. 


Derivation * 


We now derive the Kalman smoother, following the presentation of Jordan 2007, sec 15.7). 

The key idea is to leverage the Markov property, which says that z; is independent of future 
data, yt41:7, as long as z41 is known. Of course, z+ 1 is not known, but we have a distribution 
over it. So we condition on z,,, and then integrate it out, as follows. 


P(zilyur) = [ve Yur, Zt41)P(Zt41|V1:7 )dZi41 (18.60) 


Il 


[ve Vise, Veer, Zt+1 )P(Zt+1|Y1:T)dZt+1 (18.61) 


By induction, assume we have already computed the smoothed distribution for t + 1: 


p(Zeqilyir) = N (Zt+1 Hi+ijT: Ear) (18.62) 


The question is: how do we perform the integration? 
First, we compute the filtered two-slice distribution p(zZz, Z++1|y1:+) as follows: 


E Zt Melt Vez Eo 
Zt, Z a) =N 18.63 
P(Zt, Zt+1|Y1:+) aA | HA Ca Pail ( ) 
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Now we use Gaussian conditioning to compute p(Zz|Z:41, Y1-) as follows: 


p(@elZeza, Yat) = N (Zelte + Se(tera — Hipi) Bae — Deepa?) (18.64) 


We can compute the smoothed distribution for ¢ using the rules of iterated expectation and 
iterated covariance. First, the mean: 


Myr = z [ è [Ze|Zt41, 1:7] lyi7] (18.65) 
= E[E [z:\z141, yi] |yxr] (18.66) 
= E [Pee + Ji (Zep — Messe) |¥a7] (18.67) 
= Hiet Ji(Misajr = Misr) (18.68) 
Now the covariance: 
Xir = cov [E [Ze|Ze41, y1:T] lyxr] +E [cov [Zt |Zt41, ¥1:7] lyxr] (18.69) 
= covy [E [z;|2441; Yra] lyier] +E [cov [zizi Yit] [yi-7] (18.70) 


= cov [r + Ji (Zi41 — Mesa lyir| +E [Det a LEi lyr] (18.71) 


= Jcov [z1 — Pegapl¥aer] JT + Vez = ANETA (18.72) 
= Erard? T Xi — LEJ? (18.73) 
= Die + Je(Veqayr — E)E (18.74) 


The algorithm can be initialized from uyyp and 7) from the last step of the filtering algo- 
rithm. 


Comparison to the forwards-backwards algorithm for HMMs * 


Note that in both the forwards and backwards passes for LDS, we always worked with normalized 
distributions, either conditioned on the past data or conditioned on all the data. Furthermore, 
the backwards pass depends on the results of the forwards pass. This is different from the usual 
presentation of forwards-backwards for HMMs, where the backwards pass can be computed 
independently of the forwards pass (see Section 17.4.3). 

It turns out that we can rewrite the Kalman smoother in a modified form which makes it 
more similar to forwards-backwards for HMMs. In particular, we have 


p(zlyi:r) = | vledly ez.) Cersslvicr dares (18.75) 
P(Zeilyur) 

= aee 18.76 

[ples zessivi) Pacal Zt+1 ( ) 


Now 


P(Yt+1:T|Zt+1, Yra) P(Zt+1|Y1:+) 


(18.77) 
P(¥t41:TlY1:t) 


p(Z41ly1-7) = 
so 
p(z lyr) P(Ze41|¥1:t)P(Ye41:7|Zt41) 


= X p(¥t41:7|Zt41) (18.78) 
p(Zt41/¥1:t) P(Ze41|¥1:)P(Yeq1:7 1:0) 
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which is the conditional likelihood of the future data. This backwards message can be computed 
independently of the forwards message. However, this approach has several disadvantages: (1) 
it needs access to the original observation sequence; (2) the backwards message is a likelihood, 
not a posterior, so it need not to integrate to 1 over z - in fact, it may not always be possible 
to represent p(y++1.7|Z:41) as a Gaussian with positive definite covariance (this problem does 
not arise in discrete state-spaces, as used in HMMs); (3) when exact inference is not possible, it 
makes more sense to try to approximate the smoothed distribution rather than the backwards 
likelihood term (see Section 22.5). 

There is yet another variant, known as two-filter smoothing, whereby we compute p(z:|y1:+) 
in the forwards pass as usual, and the filtered posterior p(z:|y:41.7) in the backwards pass. 
These can then be easily combined to compute p(z;|y1.7). See (Kitagawa 2004; Briers et al. 
2010) for details. 


Learning for LG-SSM 


In this section, we briefly discuss how to estimate the parameters of an LG-SSM. In the control 
theory community, this is known as systems identification (Ljung 1987). 

When using SSMs for time series forecasting, and also in some physical state estimation 
problems, the observation matrix C and the transition matrix A are both known and fixed, by 
definition of the model. In such cases, all that needs to be learned are the noise covariances Q 
and R. (The initial state estimate pọ is often less important, since it will get “washed away” by 
the data after a few time steps. This can be encouraged by setting the initial state covariance 
to be large, representing a weak prior.) Although we can estimate Q and R offline, using the 
methods described below, it is also possible to derive a recursive procedure to exactly compute 
the posterior p(z+, R, Q|y1:+), which has the form of a Normal-inverse-Wishart; see (West and 
Harrison 1997; Prado and West 2010) for details. 


Identifiability and numerical stability 


In the more general setting, where the hidden states have no pre-specified meaning, we need to 
learn A and C. However, in this case we can set Q = I without loss of generality, since an 
arbitrary noise covariance can be modeled by appropriately modifying A. Also, by analogy with 
factor analysis, we can require R to be diagonal without loss of generality. Doing this reduces 
the number of free parameters and improves numerical stability. 

Another constraint that is useful to impose is on the eigenvalues of the dynamics matrix A. 
To see why this is important, consider the case of no system noise. In this case, the hidden 
state at time t is given by 


z = Atz, = UA'U"'2, (18.79) 


where U is the matrix of eigenvectors for A, and A = diag(A;) contains the eigenvalues. If 
any A; > 1, then for large t, z; will blow up in magnitude. Consequently, to ensure stability, it 
is useful to require that all the eigenvalues are less than 1 (Siddiqi et al. 2007). Of course, if all 
the eigenvalues are less than 1, then E [z,] = O for large t, so the state will return to the origin. 
Fortunately, when we add noise, the state become non-zero, so the model does not degenerate. 
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Below we discuss how to estimate the parameters. However, for simplicity of presentation, we 
do not impose any of the constraints mentioned above. 


Training with fully observed data 


If we observe the hidden state sequences, we can fit the model by computing the MLEs (or even 
the full posteriors) for the parameters by solving a multivariate linear regression problem for 
Zt-1 > Z; and for z; —> yı. That is, we can estimate A by solving the least squares problem 
J(A) = S (z —Az,_1)*, and similarly for C. We can estimate the system noise covariance 
Q from the residuals in predicting z; from z;_;, and estimate the observation noise covariance 
R from the residuals in predicting y+ from z+. 


EM for LG-SSM 


If we only observe the output sequence, we can compute ML or MAP estimates of the parameters 
using EM. The method is conceptually quite similar to the Baum-Welch algorithm for HMMs 
(Section 17.5), except we use Kalman smoothing instead of forwards-backwards in the E step, 
and use different calculations in the M step. We leave the details to Exercise 18.1. 


Subspace methods 


EM does not always give satisfactory results, because it is sensitive to the initial parameter 
estimates. One way to avoid this is to use a different approach known as a subspace method 
(Overschee and Moor 1996; Katayama 2005). 

To understand this approach, let us initially assume there is no observation noise and no 
system noise. In this case, we have z; = Az ,—1 and y; = Cz, and hence y; = CA‘ tz}. 
Consequently all the observations must be generated from a dim(z;)-dimensional linear mani- 
fold or subspace. We can identify this subspace using PCA (see the above references for details). 
Once we have an estimate of the z,’s, we can fit the model as if it were fully observed. We can 
either use these estimates in their own right, or use them to initialize EM. 


Bayesian methods for “fitting” LG-SSMs 


There are various offline Bayesian alternatives to the EM algorithm, including variational Bayes 
EM (Beal 2003; Barber and Chiappa 2007) and blocked Gibbs sampling (Carter and Kohn 1994; 
Cappe et al. 2005; Fruhwirth-Schnatter 2007). The Bayesian approach can also be used to 
perform online learning, as we discussed in Section 18.2.3. Unfortunately, once we add the SSM 
parameters to the state space, the model is generally no longer linear Gaussian. Consequently 
we must use some of the approximate online inference methods to be discussed below. 


Approximate online inference for non-linear, non-Gaussian SSMs 


In Section 18.3.1, we discussed how to perform exact online inference for LG-SSMs. However, 
many models are non linear. For example, most moving objects do not move in straight lines. 
And even if they did, if we assume the parameters of the model are unknown and add them 
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to the state space, the model becomes nonlinear. Furthermore, non-Gaussian noise is also very 
common, e.g., due to outliers, or when inferring parameters for GLMs instead of just linear 
regression. For these more general models, we need to use approximate inference. 

The approximate inference algorithms we discuss below approximate the posterior by a Gaus- 
sian. In general, if Y = f(X), where X has a Gaussian distribution and f is a non-linear 
function, there are two main ways to approximate p(Y) by a Gaussian. The first is to use a 
first-order approximation of f. The second is to use the exact f, but to project f(X) onto the 
space of Gaussians by moment matching. We discuss each of these methods in turn. (See also 
Section 23.5, where we discuss particle filtering, which is a stochastic algorithm for approximate 
online inference, which uses a non-parametric approximation to the posterior, which is often 
more accurate but slower to compute.) 


Extended Kalman filter (EKF) 


In this section, we focus on non-linear models, but we assume the noise is Gaussian. That is, 
we consider models of the form 


Z = g(ur,Z-1) + N(O, Q;) (18.80) 
ye = h(a) +N(0, Ry) (18.81) 


where the transition model g and the observation model h are nonlinear but differentiable 
functions. Furthermore, we focus on the case where we approximate the posterior by a single 
Gaussian. (The simplest way to handle more general posteriors (e.g., multi-modal, discrete, etc). 
is to use particle filtering, which we discuss in Section 23.5.) 

The extended Kalman filter or EKF can be applied to nonlinear Gaussian dynamical systems 
of this form. The basic idea is to linearize g and h about the previous state estimate using 
a first order Taylor series expansion, and then to apply the standard Kalman filter equations. 
(The noise variance in the equations (Q and R) is not changed, i.e., the additional error due to 
linearization is not modeled.) Thus we approximate the stationary non-linear dynamical system 
with a non-stationary linear dynamical system. 

The intuition behind the approach is shown in Figure 18.9, which shows what happens when 
we pass a Gaussian distribution p(x), shown on the bottom right, through a nonlinear function 
y = g(x), shown on the top right. The resulting distribution (approximated by Monte Carlo) is 
shown in the shaded gray area in the top left corner. The best Gaussian approximation to this, 
computed from E [g(x)] and var [g(x)] by Monte Carlo, is shown by the solid black line. The 
EKF approximates this Gaussian as follows: it linearizes the g function at the current mode, p, 
and then passes the Gaussian distribution p(x) through this linearized function. In this example, 
the result is quite a good approximation to the first and second moments of p(y), for much less 
cost than an MC approximation. 

In more detail, the method works as follows. We approximate the measurement model using 


P(¥ilZe) =~ N (yihui) + Hi (ve — Meje_1), Re) (18.82) 
where H; is the Jacobian matrix of h evaluated at the prior mode: 
hj 
a, & Me (18.83) 
Oz; 


Be = panes (18.84) 
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Figure 18.9 Nonlinear transformation of a Gaussian random variable. The prior p(x) is shown on the 
bottom right. The function y = g(x) is shown on the top right. The transformed distribution p(y) is 
shown in the top left. A linear function induces a Gaussian distribution, but a non-linear function induces 
a complex distribution. The solid line is the best Gaussian approximation to this; the dotted line is the EKF 
approximation to this. Source: Figure 3.4 of (Thrun et al. 2006). Used with kind permission of Sebastian 


Thrun. 


Similarly, we approximate the system model using 


p(Zt|Ze—-1, Ut) & N (zigu, Hi1) + Gi(Zzt-1 — Hi1), Qt) 


where 


G = G(u)le=p,_, 


so G is the Jacobian matrix of g evaluated at the prior mode. 


Given this, we can then apply the Kalman filter to compute the posterior as follows: 


Met-1 = g(Ur, Hi1) 

Vit- = GV1 G7 +Q: 
K, = Vae- H7 (H; V HF +R) t 
Me = Hyi- + Kel(ye — h(he1)) 


Vi = (I-K:H:)Vit-i 


(18.85) 


(18.86) 


(18.87) 


(18.88) 
(18.89) 
(18.90) 
(18.91) 
(18.92) 
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Figure 18.10 An example of the unscented transform in two dimensions. Source: (Wan and der Merwe 
2001). Used with kind permission of Eric Wan. 


We see that the only difference from the regular Kalman filter is that, when we compute the 
state prediction, we use g(uz, f4,_,) instead of A;yps,_; + Buz, and when we compute the 
measurement update we use h(j1,,_) instead of Cypry,_1- 

It is possible to improve performance by repeatedly re-linearizing the equations around p, 
instead of Meta this is called the iterated EKF, and yields better results, although it is of 
course slower. 

There are two cases when the EKF works poorly. The first is when the prior covariance is 
large. In this case, the prior distribution is broad, so we end up sending a lot of probability 
mass through different parts of the function that are far from the mean, where the function has 
been linearized. The other setting where the EKF works poorly is when the function is highly 
nonlinear near the current mean. In Section 18.5.2, we will discuss an algorithm called the UKF 
which works better than the EKF in both of these settings. 


Unscented Kalman filter (UKF) 


The unscented Kalman filter (UKF) is a better version of the EKF (Julier and Uhlmann 1997). 
(Apparently it is so-called because it “doesn’t stink”) The key intuition is this: it is easier 
to approximate a Gaussian than to approximate a function. So instead of performing a linear 
approximation to the function, and passing a Gaussian through it, instead pass a deterministically 
chosen set of points, known as sigma points, through the function, and fit a Gaussian to the 
resulting transformed points. This is known as the unscented transform, and is sketched in 
Figure 18.10. (We explain this figure in detail below.) 
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The UKF basically uses the unscented transform twice, once to approximate passing through 
the system model g, and once to approximate passing through the measurement model h. We 
give the details below. Note that the UKF and EKF both perform O(d*) operations per time step 
where d is the size of the latent state-space. However, the UKF is accurate to at least second 
order, whereas the EKF is only a first order approximation (although both the EKF and UKF can 
be extended to capture higher order terms). Furthermore, the unscented transform does not 
require the analytic evaluation of any derivatives or Jacobians (a so-called derivative free filter), 
making it simpler to implement and more widely applicable. 


The unscented transform 


Before explaining the UKF, we first explain the unscented transform. Assume p(x) = \V(x|p, ©), 
and consider estimating p(y), where y = f(x) for some nonlinear function f. The unscented 
transform does this as follows. First we create a set of 2d + 1 sigma points x;, given by 


x = (u, {u+ (VEFNS) thi {u - (VFS) tL) (18.93) 


where À = a?(d+ K) — d is a scaling parameter to be specified below, and the notation M.; 
means the i'th column of matrix M. 

These sigma points are propagated through the nonlinear function to yield y; = f(x;), and 
the mean and covariance for y is computed as follows: 


2d 
by = Dei (18.94) 
i=0 
2d 
Sy = SY wilys—my)(vi— Hy)" (18.95) 
i=0 
where the w’s are weighting terms, given by 
; À 
n = a 18.96 
Wm JLA (18.96) 
w = E (18.97) 
€ — d+À l 
; : 1 
to a gi 18.9 
Hm We 2(d + 2) 110:8) 


See Figure 18.10 for an illustration. 

In general, the optimal values of a, 8 and « are problem dependent, but when d = 1, they 
are a = 1, 8 = 0, k = 2. Thus in the ld case, A = 2, so the 3 sigma points are u, u + V30 
and u — v30. 


The UKF algorithm 


The UKF algorithm is simply two applications of the unscented tranform, one to compute 
p(Zzt|Y1:t—-1, U1:t) and the other to compute p(z:|Y1:t, U1:+). We give the details below. 
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The first step is to approximate the predictive density p(z:|yi4—1, Ui) © N(zt|,, Ex) by 
passing the old belief state \V(Z:—1|U4,_,, 2 z—1) through the system model g as follows: 


Z = (mei {Hee + V0 De) sf (oe V Ei-a)aiha) (18.99) 


z = gu, zpi) om) 
2d 

m = vpz (18.101) 
4=0 
2d 

x, = X viz — ,)(2f* — T) + Q (18.102) 
1=0 


where y = vd + À. 


The second step is to approximate the likelihood p(y;|z,) ~ N (y:|ft, S+) by passing the 


prior N’(zz|f,, X+) through the observation model h: 


= (Bota Eesha tim -V Bda) 48.103) 


yi = h@') (18.104) 
2d 

H = yey (18.105) 
1=0 
2d 

Se = J wily -yF — IH)" +Re (18.106) 
1=0 


Finally, we use Bayes rule for Gaussians to get the posterior p(Z:|Y1:4, U4) ~ N (Zi |e, Xi): 


2d 
D” = Swit -ayi -y oe 
i=0 
K, = Es! (18.108) 
h = a +tK(y-ĵ:) (pos 
S a 5, 7 K,S,K2 (18.110) 


18.5.3 Assumed density filtering (ADF) 


In this section, we discuss inference where we perform an exact update step, but then approx- 
imate the posterior by a distribution of a certain convenient form, such as a Gaussian. More 
precisely, let the unknowns that we want to infer be denoted by 0;. Suppose that Q is a set of 
tractable distributions, e.g., Gaussians with a diagonal covariance matrix, or a product of discrete 
distributions. Suppose that we have an approximate prior q:—1(0:-1) © p(@+-1|y1-4-1), where 
dt—-1 E€ Q. We can update this with the new measurement to get the approximate posterior 


g 1 
P(O) = ZPYO) (91) (18.111) 
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Figure 18.11 (a) Illustration of the predict-update-project cycle of assumed density filtering. (b) A dynam- 
ical logistic regression model. Compare to Figure 18.4(a). 


where 
A= [rll e)aue-1(8:)0, (18.112) 
is the normalization constant and 


ane1(61) = f POO: 1)4t—1(9¢-1)dO¢_1 (18.113) 


is the one step ahead predictive distribution. If the prior is from a suitably restricted family, this 
one-step update process is usually tractable. However, we often find that the resulting posterior 
is no longer in our tractable family, 6(@;) ¢ Q. So after updating we seek the best tractable 
approximation by computing 


q(9,) = argmin KL (p(4;)||q(:)) (18.114) 
qEQ 


This minimizes the the Kullback-Leibler divergence (Section 2.8.2) from the approximation q(0+) 
to the “exact” posterior p(0+), and can be thought of as projecting p onto the space of tractable 
distributions. The whole algorithm consists of predict-update-project cycles. This is known as 
assumed density filtering or ADF (Maybeck 1979). See Figure 18.11(a) for a sketch. 

If q is in the exponential family, one can show that this KL minimization can be done by 
moment matching. We give some examples of this below. 


Boyen-Koller algorithm for online inference in DBNs 


If we are performing inference in a discrete-state dynamic Bayes net (Section 17.6.7), where 4;; 
is the j’th hidden variable at time t, then the exact posterior p(0@,) becomes intractable to 
compute because of the entanglement problem. Suppose we use a fully factored approximation 
of the form q(0+) = Ia Cat(0i ilmi j), where Tijk = q(01,; = k) is the probability variable 
j is in state k, and D is the number of variables. In this case, the moment matching operation 
becomes 


Tijk = P(O j = k) (18.115) 
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This can be computed by performing a predict-update step using the factored prior, and then 
computing the posterior marginals. This is known as the Boyen-Koller algorithm, named after 
the authors of (Boyen and Koller 1998), who demonstrated that the error incurred by this series 
of repeated approximations remains bounded (under certain assumptions about the stochasticity 
of the system). 


Gaussian approximation for online inference in GLMs 


Now suppose q(0;) = iran N (Oi jlt js Tej) Where Ts j is the variance. Then the optimal 
parameters of the tractable approximation to the posterior are 


Ht, j = p [61,5] > Tt = varp [0i] (18.116) 


This method can be used to do online inference for the parameters of many statistical models. 
For example, theTrueSkill system, used in Microsoft's Xbox to rank players over time, uses this 
form of approximation (Herbrich et al. 2007). We can also apply this method to simpler models, 
such as GLM, which have the advantage that the posterior is log-concave. Below we explain how 
to do this for binary logistic regression, following the presentation of (Zoeter 2007). 

The model has the form 


D(yilXt,94) = Ber(y,|sigm(x/ 6;)) (18.117) 
p(O:|0:-1) = N(0:|0:-1,071) (18.118) 
where g? is some process noise which allows the parameters to change slowly over time. (This 


can be set to 0, as in the recursive least squares method (Section 18.2.3), if desired.) We will 
assume q:—1(0:-1) = IL N (Ot—1,j|Ht—1,j; Te-1,5) is the tractable prior. We can compute the 
one-step-ahead predictive density q;j;_1(@+) using the standard linear-Gaussian update. So now 
we concentrate on the measurement update step. 

Define the deterministic quantity s; = 6} x:, as shown in Figure 18.11(b). If e\t—1(92) = 
I, N (61,5 |Mt|t—1,j+ Ttjt-1,j) then we can compute the predictive distribution for s+ as follows: 


t\t—1 (St) = N (sz|Meje—1, Vejt—1) (18.119) 
Mayi = >) tijh (18.120) 
va = > 82 Tay (18.121) 

J 


The posterior for s; is given by 


a(s:) = N(s: ee vg) (18.122) 
Mm, = fsz P(YelSt)aeje—1 (St) ase (18.123) 
1 
U = fab p (yelse) dt\t— 1(Sz)dsz = m? (18.124) 
t 
Z = fa Plyelst)die-1(st)dst (18.125) 
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where p(y:|5;) = Ber(y:|s;). These integrals are one dimensional, and so can be computed 
using Gaussian quadrature (see (Zoeter 2007) for details). This is the same as one step of the 
UKF algorithm. 

Having inferred q(s+), we need to compute q(0|s+). This can be done as follows. Define ôm 
as the change in the mean of s; and ô, as the change in the variance: 


Mi = Mijt—1 + Om, Ve = Vejt—1 + Ov (18.126) 


Then one can show that the new factored posterior over the model parameters is given by 


alr) = N (Oi jlhtjs Tej) (18.127) 
Htj = Hijt-1,j + fom (18.128) 
Tej = Tele-1,9 + O95 (18.129) 

Le Titiz 
a; = u Meg (18.130) 


2 2 
Dy Tiy tlt—1,j 


Thus we see that the parameters which correspond to inputs with larger magnitude (big |x+ ;|) 
or larger uncertainty (big 7;|;-1,;) get updated most, which makes intuitive sense. 

In (Opper 1998) a version of this algorithm is derived using a probit likelihood (see Section 9.4). 
In this case, the measurement update can be done in closed form, without the need for numerical 
integration. In either case, the algorithm only takes O(D) operations per time step, so it can 
be applied to models with large numbers of parameters. And since it is an online algorithm, 
it can also handle massive datasets. For example (Zhang et al. 2010) use a version of this 
algorithm to fit a multi-class classifier online to very large datasets. They beat alternative (non 
Bayesian) online learning algorithms, and sometimes even outperform state of the art batch 
(offline) learning methods such as SVMs (described in Section 14.5). 


Hybrid discrete/continuous SSMs 


Many systems contain both discrete and continuous hidden variables; these are known as hybrid 
systems. For example, the discrete variables may indicate whether a measurement sensor is 
faulty or not, or which “regime” the system is in. We will see some other examples below. 

A special case of a hybrid system is when we combine an HMM and an LG-SSM. This is 
called a switching linear dynamical system (SLDS), a jump Markov linear system (JMLS), 
or a switching state space model (SSSM). More precisely, we have a discrete latent variable, 
qt € {1,..., K}, a continuous latent variable, z; € RŻ, an continuous observed response 
yı € R? and an optional continuous observed input or control u, € RY. We then assume that 
the continuous variables have linear Gaussian CPDs, conditional on the discrete states: 


plq =kla-1 = j, 0) Aij (18.131) 
p(Zi|Ze-1, a = k, ut, 0) = N (zi|AkZt—1 + Bkuz, Qk) (18.132) 
P(yilZe, qt = k, u,0) = N(yi|Ckz: + Dru, Re) (18.133) 


See Figure 18.12(a) for the DGM representation. 


18.6.1 


18.6.1.1 


656 Chapter 18. State space models 


Ut-1 Ut 
O 
di—1 qt 
Zt—1 
Yiı—ı Yı 
(a) (b) 


Figure 18.12 A switching linear dynamical system. (a) Squares represent discrete nodes, circles represent 
continuous nodes. (b) Illustration of how the number of modes in the belief state grows exponentially over 
time. We assume there are two binary states. 


Inference 


Unfortunately inference (i.e., state estimation) in hybrid models, including the switching LG- 
SSM model, is intractable. To see why, suppose q is binary, but that only the dynamics 
A depend on q, not the observation matrix. Our initial belief state will be a mixture of 
2 Gaussians, corresponding to p(zi/y1,q1 = 1) and p(zilyi,q1 = 2). The one-step-ahead 
predictive density will be a mixture of 4 Gaussians p(zə|y1,q1 = 1,g2 = 1), p(Zelyi,q. = 
1, q2 = 2), p(zely1, q1 = 2, q2 = 1), and p(z2|y1,q1 = 2, q2 = 2), obtained by passing each of 
the prior modes through the 2 possible transition models. The belief state at step 2 will also be 
a mixture of 4 Gaussians, obtained by updating each of the above distributions with y2. At step 
3, the belief state will be a mixture of 8 Gaussians. And so on. So we see there is an exponential 
explosion in the number of modes (see Figure 18.12(b)). 

Various approximate inference methods have been proposed for this model, such as the 
following: 


e Prune off low probability trajectories in the discrete tree; this is the basis of multiple 
hypothesis tracking (Bar-Shalom and Fortmann 1988; Bar-Shalom and Li 1993). 


e Use Monte Carlo. Essentially we just sample discrete trajectories, and apply an analytical 
filter to the continuous variables conditional on a trajectory. See Section 23.6 for details. 


e Use ADF, where we approximate the exponentially large mixture of Gaussians with a smaller 
mixture of Gaussians. See Section 18.6.1.1 for details. 
A Gaussian sum filter for switching SSMs 


A Gaussian sum filter (Sorenson and Alspach 1971) approximates the belief state at each step 
by a mixture of K Gaussians. This can be implemented by running K Kalman filters in 
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Figure 18.13 ADF for a switching linear dynamical system. (a) GPB2 method. (b) IMM method. See text 
for details. 


parallel. This is particularly well suited to switching SSMs. We now describe one version of this 
algorithm, known as the “second order generalized pseudo Bayes filter” (GPB2) (Bar-Shalom 
and Fortmann 1988). We assume that the prior belief state b;_; is a mixture of K Gaussians, 
one per discrete state: 


bia £ p(Zi1, q-1 = tly14-1) = T: 1N (Zt Meir Decii) (18.134) 
We then pass this through the K different linear models to get 
bI © p(z G1 =i, qi = Jlyit) = Tig N (Ztl Hi ijs rij) (18.135) 


where Tij = Tt—1,iP(qt = J|@—1 = i). Finally, for each value of j, we collapse the K Gaussian 
mixtures down to a single mixture to give 


bf & p(ae, ge = lyase) = mN (ele j Beg) (18.136) 
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See Figure 18.13(a) for a sketch. 
The optimal way to approximate a mixture of Gaussians with a single Gaussian is given by 
q = arg min, KL (q||p), where p(z) = 3°, nN (z|ug, Ux) and q(z) = N (z|, ©). This can 


be solved by moment matching, that is, 


p = Efe] =X TkH (18.137) 
k 
E = cov[z] = Some (Ze + (oy — H), — H)”) (18.138) 
k 


In the graphical model literature, this is called weak marginalization (Lauritzen 1992), since it 
preserves the first two moments. Applying these equations to our model, we can go from b; to 
b? as follows (where we drop the t subscript for brevity): 


tm = $ tij (18.139) 
Tjj Sa. (18.140) 
a Doge Mag! l 

By = D> Tjihi (18.141) 
uj = 5 Tjj (Eiz + (Mig — My) (Mig — Hy”) (18.142) 


This algorithm requires running K? filters at each step. A cheaper alternative is to represent 
the belief state by a single Gaussian, marginalizing over the discrete switch at each step. This 
is a straightforward application of ADF. An offline extension to this method, called expectation 
correction, is described in (Barber 2006; Mesot and Barber 2009). 

Another heuristic approach, known as interactive multiple models or IMM (Bar-Shalom and 
Fortmann 1988), can be obtained by first collapsing the prior to a single Gaussian (by moment 
matching), and then updating it using K different Kalman filters, one per value of qj. See 
Figure 18.13(b) for a sketch. 


Application: data association and multi-target tracking 


Suppose we are tracking K objects, such as airplanes, and at time t, we observe K’ detection 
events, e.g, “blips” on a radar screen. We can have K’ < K due to occlusion or missed 
detections. We can have K’ > K due to clutter or false alarms. Or we can have K’ = K. In 
any case, we need to figure out the correspondence between the K’ detections y;z and the K 
objects z;,;. This is called the problem of data association, and it arises in many application 
domains. 

Figure 18.14 gives an example in which we are tracking K = 2 objects. At each time step, qt 
is the unknown mapping which specifies which objects caused which observations. It specifies 
the “wiring diagram” for time slice t. The standard way to solve this problem is to compute 
a weight which measures the “compatibility” between object j and measurement k, typically 
based on how close k is to where the model thinks j should be (the so-called nearest neighbor 
data association heuristic). This gives us a K x K’ weight matrix. We can make this into a 
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Figure 18.14 A model for tracking two objects in the presence of data-assocation ambiguity. We observe 
3, 1 and 2 detections in the first three time steps. 


square matrix of size N x N, where N = max(K, K’), by adding dummy background objects, 
which can explain all the false alarms, and adding dummy observations, which can explain all 
the missed detections. We can then compute the maximal weight bipartite matching using the 
Hungarian algorithm, which takes O(N 3) time (see e.g., (Burkard et al. 2009)). Conditional 
on this, we can perform a Kalman filter update, where objects that are assigned to dummy 
observations do not perform a measurement update. 

An extension of this method, to handle a variable and/or unknown number of objects, is 
known as multi-target tracking. This requires dealing with a variable-sized state space. There 
are many ways to do this, but perhaps the simplest and most robust methods are based on 
sequential Monte Carlo (e.g., (Ristic et al. 2004) or MCMC (e.g., (Khan et al. 2006; Oh et al. 
2009)). 


Application: fault diagnosis 


Consider the model in Figure 18.15(a). This represents an industrial plant consisting of various 
tanks of liquid, interconnected by pipes. In this example, we just have two tanks, for simplicity. 
We want to estimate the pressure inside each tank, based on a noisy measurement of the flow 
into and out of each tank. However, the measurement devices can sometimes fail. Furthermore, 
pipes can burst or get blocked; we call this a “resistance failure”. This model is widely used as 
a benchmark in the fault diagnosis community (Mosterman and Biswas 1999). 

We can create a probabilistic model of the system as shown in Figure 18.15(b). The square 
nodes represent discrete variables, such as measurement failures and resistance failures. The 
remaining variables are continuous. A variety of approximate inference algorithms can be applied 
to this model. See (Koller and Lerner 2001) for one approach, based on Rao-Blackwellized particle 
filtering (which is explained in Section 23.6). 
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Figure 18.15 (a) The two-tank system. The goal is to infer when pipes are blocked or have burst, or 
sensors have broken, from (noisy) observations of the flow out of tank 1, F1o, out of tank 2, F'2o, or 
between tanks 1 and 2, F12. Rlo is a hidden variable representing the resistance of the pipe out of 
tank 1, P1 is a hidden variable representing the pressure in tank 1, etc. Source: Figure 11 of (Koller and 
Lerner 2001) . Used with kind permission of Daphne Koller. (b) Dynamic Bayes net representation of the 
two-tank system. Discrete nodes are squares, continuous nodes are circles. Abbreviations: R = resistance, 
P = pressure, F = flow, M = measurement, RF = resistance failure, MF = measurement failure. Based on 
Figure 12 of (Koller and Lerner 2001). 


Application: econometric forecasting 


The switching LG-SSM model is widely used in econometric forecasting, where it is called 
a regime switching model. For example, we can combine two linear trend models (see Sec- 
tion 18.2.4.2), one in which b; > 0 reflects a growing economy, and one in which b; < 0 reflects 
a shrinking economy. See (West and Harrison 1997) for further details. 


Exercises 


Exercise 18.1 Derivation of EM for LG-SSM 


Derive the E and M steps for computing a (locally optimal) MLE for an LG-SSM model. Hint: the results 
are in (Ghahramani and Hinton 1996b); your task is to derive these results. 


Exercise 18.2 Seasonal LG-SSM model in standard form 
Write the seasonal model in Figure 18.7(a) as an LG-SSM. Define the matrices A, C, Q and R. 
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Undirected graphical models (Markov 
random fields) 


Introduction 


In Chapter 10, we discussed directed graphical models (DGMs), commonly known as Bayes nets. 
However, for some domains, being forced to choose a direction for the edges, as required by 
a DGM, is rather awkward. For example, consider modeling an image. We might suppose that 
the intensity values of neighboring pixels are correlated. We can create a DAG model with a 2d 
lattice topology as shown in Figure 19.1(a). This is known as a causal MRF or a Markov mesh 
(Abend et al. 1965). However, its conditional independence properties are rather unnatural. In 
particular, the Markov blanket (defined in Section 10.5) of the node Xs in the middle is the other 
colored nodes (3, 4, 7, 9, 12 and 13) rather than just its 4 nearest neighbors as one might expect. 

An alternative is to use an undirected graphical model (UGM), also called a Markov random 
field (MRF) or Markov network. These do not require us to specify edge orientations, and are 
much more natural for some problems such as image analysis and spatial statistics. For example, 
an undirected 2d lattice is shown in Figure 19.1(b); now the Markov blanket of each node is just 
its nearest neighbors, as we show in Section 19.2. 

Roughly speaking, the main advantages of UGMs over DGMs are: (1) they are symmetric and 
therefore more “natural” for certain domains, such as spatial or relational data; and (2) discrimi- 
nativel UGMs (aka conditional random fields, or CRFs), which define conditional densities of the 
form p(y|x), work better than discriminative DGMs, for reasons we explain in Section 19.6.1. The 
main disadvantages of UGMs compared to DGMs are: (1) the parameters are less interpretable 
and less modular, for reasons we explain in Section 19.3; and (2) parameter estimation is com- 
putationally more expensive, for reasons we explain in Section 19.5. See (Domke et al. 2008) for 
an empirical comparison of the two approaches for an image processing task. 


Conditional independence properties of UGMs 


Key properties 


UGMs define CI relationships via simple graph separation as follows: for sets of nodes A, B, 
and C, we say x4 LG Xpl|xc iff C separates A from B in the graph G. This means that, 
when we remove all the nodes in C, if there are no paths connecting any node in A to any 
node in B, then the CI property holds. This is called the global Markov property for UGMs. 
For example, in Figure 19.2(b), we have that {1,2} L {6,7}|{3, 4, 5}. 
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Figure 19.1 (a) A 2d lattice represented as a DAG. The dotted red node Xs is independent of all other 
nodes (black) given its Markov blanket, which include its parents (blue), children (green) and co-parents 
(orange). (b) The same model represented as a UGM. The red node Xs is independent of the other black 
nodes given its neighbors (blue nodes). 


Figure 19.2 (a) A DGM. (b) Its moralized version, represented as a UGM. 


The set of nodes that renders a node ¢ conditionally independent of all the other nodes in 
the graph is called ts Markov blanket; we will denote this by mb(t). Formally, the Markov 
blanket satisfies the following property: 


t L V\ el(t)|mb(t) 19.) 


where cl(t) = mb(t) U {t} is the closure of node t. One can show that, in a UGM, a node's 
Markov blanket is its set of immediate neighbors. This is called the undirected local Markov 
property. For example, in Figure 19.2(b), we have mb(5) = {2, 3, 4, 6}. 

From the local Markov property, we can easily see that two nodes are conditionally indepen- 
dent given the rest if there is no direct edge between them. This is called the pairwise Markov 
property. In symbols, this is written as 


s LtV \{s,t} — Ge =0 (19.2) 


Using the three Markov properties we have discussed, we can derive the following CI properties 
(amongst others) from the UGM in Figure 19.2(b): 


e Pairwise 1 | 7|rest 
e Local 1 L rest|2,3 
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Figure 19.3 Relationship between Markov properties of UGMs. 


< ý <|> 


(a) (b) 


Figure 19.4 (a) The ancestral graph induced by the DAG in Figure 19.2(a) wrt U = {2,4,5}. (b) The 
moralized version of (a). 


e Global 1,2 L 6,7|3,4,5 


It is obvious that global Markov implies local Markov which implies pairwise Markov. What is 
less obvious, but nevertheless true (assuming p(x) > 0 for all x, i.e., that p is a positive density), 
is that pairwise implies global, and hence that all these Markov properties are the same, as 
illustrated in Figure 19.3 (see e.g., (Koller and Friedman 2009, p119) for a proof).! The importance 
of this result is that it is usually easier to empirically assess pairwise conditional independence; 
such pairwise CI statements can be used to construct a graph from which global CI statements 
can be extracted. 


An undirected alternative to d-separation 


We have seen that determinining CI relationships in UGMs is much easier than in DGMs, because 
we do not have to worry about the directionality of the edges. In this section, we show how to 
determine CI relationships for a DGM using a UGM. 

It is tempting to simply convert the DGM to a UGM by dropping the orientation of the edges, 
but this is clearly incorrect, since a v-structure A > B < C has quite different CI properties 
than the corresponding undirected chain A — B — C. The latter graph incorrectly states that 
A L C|B. To avoid such incorrect CI statements, we can add edges between the “unmarried” 
parents A and C, and then drop the arrows from the edges, forming (in this case) a fully 
connected undirected graph. This process is called moralization. Figure 19.2(b) gives a larger 


1. The restriction to positive densities arises because deterministic constraints can result in independencies present in 
the distribution that are not explicitly represented in the graph. See e.g., (Koller and Friedman 2009, p120) for some 
examples. Distributions with non-graphical CI properties are said to be unfaithful to the graph, so I(p) # I(G). 
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Probabilistic Models 


Graphical Models 


Directed Undirected 


Figure 19.5 DGMs and UGMs can perfectly represent different sets of distributions. Some distributions 
can be perfectly represented by either DGMs or UGMs; the corresponding graph must be chordal. 


example of moralization: we interconnect 2 and 3, since they have a common child 5, and we 
interconnect 4, 5 and 6, since they have a common child 7. 

Unfortunately, moralization loses some CI information, and therefore we cannot use the 
moralized UGM to determine CI properties of the DGM. For example, in Figure 19.2(a), using 
d-separation, we see that 4 L 5|2. Adding a moralization arc 4 — 5 would lose this fact (see 
Figure 19.2(b)). However, notice that the 4-5 moralization edge, due to the common child 7, 
is not needed if we do not observe 7 or any of its descendants. This suggests the following 
approach to determining if A L B|C. First we form the ancestral graph of DAG G with respect 
to U = AU BUC. This means we remove all nodes from G that are not in U or are not 
ancestors of U. We then moralize this ancestral graph, and apply the simple graph separation 
rules for UGMs. For example, in Figure 19.4(a), we show the ancestral graph for Figure 19.2(a) 
using U = {2,4,5}. In Figure 19.4(b), we show the moralized version of this graph. It is clear 
that we now correctly conclude that 4 L 5|2. 


Comparing directed and undirected graphical models 


Which model has more “expressive power”, a DGM or a UGM? To formalize this question, recall 
that we say that G is an I-map of a distribution p if I(G) C I(p). Now define G to be 
perfect map of p if I(G) = I(p), in other words, the graph can represent all (and only) the CI 
properties of the distribution. It turns out that DGMs and UGMs are perfect maps for different 
sets of distributions (see Figure 19.5). In this sense, neither is more powerful than the other as 
a representation language. 

As an example of some CI relationships that can be perfectly modeled by a DGM but not a 
UGM, consider a v-structure A > C + B. This asserts that A L B, and A £ BC. If we drop 
the arrows, we get A — C — B, which asserts A L B|C and A £ B, which is incorrect. In fact, 
there is no UGM that can precisely represent all and only the two CI statements encoded by a v- 
structure. In general, CI properties in UGMs are monotonic, in the following sense: if A L B|C, 
then A L B|(C U D). But in DGMs, CI properties can be non-monotonic, since conditioning 
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Figure 19.6 A UGM and two failed attempts to represent it as a DGM. Source: Figure 3.10 of (Koller and 
Friedman 2009). Used with kind permission of Daphne Koller. 


on extra variables can eliminate conditional independencies due to explaining away. 

As an example of some CI relationships that can be perfectly modeled by a UGM but not a 
DGM, consider the 4-cycle shown in Figure 19.6(a). One attempt to model this with a DGM is 
shown in Figure 19.6(b). This correctly asserts that A L C| B, D. However, it incorrectly asserts 
that B L D|A. Figure 19.6(c) is another incorrect DGM: it correctly encodes A L C|B, D, but 
incorrectly encodes B L D. In fact there is no DGM that can precisely represent all and only 
the CI statements encoded by this UGM. 

Some distributions can be perfectly modeled by either a DGM or a UGM; the resulting graphs 
are called decomposable or chordal. Roughly speaking, this means the following: if we collapse 
together all the variables in each maximal clique, to make “mega-variables”, the resulting graph 
will be a tree. Of course, if the graph is already a tree (which includes chains as a special case), 
it will be chordal. See Section 20.4.1 for further details. 


Parameterization of MRFs 


Although the CI properties of UGM are simpler and more natural than for DGMs, representing 
the joint distribution for a UGM is less natural than for a DGM, as we see below. 


The Hammersley-Clifford theorem 


Since there is no topological ordering associated with an undirected graph, we can't use the chain 
rule to represent p(y). So instead of associating CPDs with each node, we associate potential 
functions or factors with each maximal clique in the graph. We will denote the potential 
function for clique c by w,(y-|@-). A potential function can be any non-negative function of 
its arguments. The joint distribution is then defined to be proportional to the product of clique 
potentials. Rather surprisingly, one can show that any positive distribution whose CI properties 
can be represented by a UGM can be represented in this way. We state this result more formally 
below. 
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Theorem 19.3.1 (Hammersley-Clifford). A positive distribution p(y) > 0 satisfies the CI prop- 
erties of an undirected graph G iff p can be represented as a product of factors, one per maximal 
clique, i.e., 


p(y|9) = — IEZA (19.3) 
Z(0) 
cEC 


where C is the set of all the (maximal) cliques of G, and Z (0) is the partition function given by 
Z(0) =>] devel) (19.4) 
x cEC 


Note that the partition function is what ensures the overall distribution sums to 1. 


The proof was never published, but can be found in e.g., (Koller and Friedman 2009). 
For example, consider the MRF in Figure 10.1(b). If p satisfies the CI properties of this graph 
then we can write p as follows: 


1 
p(y|@) = Z(G) V12 V Ye» ys) Y234(Yy2, Y3, Y4)V35(Y3, Y5) (19.5) 
where 
Z = X tres(yr, Y2, y3)b234(Yo, ys, y4)¥35 (Ys, Y5) (19.6) 
y 


There is a deep connection between UGMs and statistical physics. In particular, there is a 
model known as the Gibbs distribution, which can be written as follows: 


ply|@) = Fi O D Eole.) (19.7) 


where E;(y.) > 0 is the energy associated with the variables in clique c. We can convert this to 
a UGM by defining 


We(¥e|A-) = exp(—E(y-|Oc)) (19.8) 


We see that high probability states correspond to low energy configurations. Models of this form 
are known as energy based models, and are commonly used in physics and biochemistry, as 
well as some branches of machine learning (LeCun et al. 2006). 

Note that we are free to restrict the parameterization to the edges of the graph, rather than 
the maximal cliques. This is called a pairwise MRF. In Figure 10.1(b), we get 


P(y|9) x dr2(y1, y2)013(y1, ¥3)Y23(Y2, Y3 P24 (Y2, Y4)Y34(Y3, Y4) P35 (Y3, Y5) (19.9) 
x [vsus u) (19.10) 


sot 


This form is widely used due to its simplicity, although it is not as general. 


2. The partition function is denoted by Z because of the German word zustandssumme, which means “sum over states”. 
This reflects the fact that a lot of pioneering working in statistical physics was done by Germans. 


19.3.2 


19.3. Parameterization of MRFs 667 


Representing potential functions 


If the variables are discrete, we can represent the potential or energy functions as tables of 
(non-negative) numbers, just as we did with CPTs. However, the potentials are not probabilities. 
Rather, they represent the relative “compatibility” between the different assignments to the 
potential. We will see some examples of this below. 

A more general approach is to define the log potentials as a linear function of the parameters: 


log We(¥e) = (Ye) Oc (19.11) 


where (xc) is a feature vector derived from the values of the variables y,. The resulting log 
probability has the form 


log p(y|@) = X` p(y)" Oe — Z(0) (19.12) 


This is also known as a maximum entropy or a log-linear model. 
For example, consider a pairwise MRF, where for each edge, we associate a feature vector of 
length K? as follows: 


Patys Ye) = [Iys = j =k), +] (19.13) 


If we have a weight for each feature, we can convert this into a K x K potential function as 
follows: 


Ystlys = j, ye = k) = exp (03 Psi]jr) = exp (Ose (J, k)) (19.14) 


So we see that we can easily represent tabular potentials using a log-linear form. But the 
log-linear form is more general. 

To see why this is useful, suppose we are interested in making a probabilistic model of English 
spelling. Since certain letter combinations occur together quite frequently (e.g., “ing”), we will 
need higher order factors to capture this. Suppose we limit ourselves to letter trigrams. A 
tabular potential still has 262 = 17,576 parameters in it. However, most of these triples will 
never occur. 

An alternative approach is to define indicator functions that look for certain “special” triples, 


”» u 


such as “ing”, “qu-”, etc. Then we can define the potential on each trigram as follows: 


W(Ye-15 Yt, Yt+1) = exp(> On bk (Yt—1; Yt, Yt+1)) (19.15) 
k 


» u 


where k indexes the different features, corresponding to “ing”, “qu-”, etc., and ¢, is the corre- 
sponding binary feature function. By tying the parameters across locations, we can define the 
probability of a word of any length using 


Ply|@) x exp( L X Onde (ye—15 Yes Yt+1)) (19.16) 
t k 


This raises the question of where these feature functions come from. In many applications, 
they are created by hand to reflect domain knowledge (we will see examples later), but it is also 
possible to learn them from data, as we discuss in Section 19.5.6. 
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Examples of MRFs 


In this section, we show how several popular probability models can be conveniently expressed 
as UGMs. 


Ising model 


The Ising model is an example of an MRF that arose from statistical physics.’ It was originally 
used for modeling the behavior of magnets. In particular, let ys E€ {—1, +1} represent the spin 
of an atom, which can either be spin down or up. In some magnets, called ferro-magnets, 
neighboring spins tend to line up in the same direction, whereas in other kinds of magnets, 
called anti-ferromagnets, the spins “want” to be different from their neighbors. 

We can model this as an MRF as follows. We create a graph in the form of a 2D or 3D lattice, 
and connect neighboring variables, as in Figure 19.1(b). We then define the following pairwise 
clique potential: 


ewst eT Wst 
Bst(Ysr Y) = | ower gwst (19.17) 


Here wss is the coupling strength between nodes s and t. If two nodes are not connected in 
the graph, we set wst = 0. We assume that the weight matrix W is symmetric, so Wst = Wts. 
Often we assume all edges have the same strength, so wst = J (assuming wst # 0). 

If all the weights are positive, J > 0, then neighboring spins are likely to be in the same 
state; this can be used to model ferromagnets, and is an example of an associative Markov 
network. If the weights are sufficiently strong, the corresponding probability distribution will 
have two modes, corresponding to the all +’s state and the all -l's state. These are called the 
ground states of the system. 

If all of the weights are negative, J < 0, then the spins want to be different from their 
neighbors; this can be used to model an anti-ferromagnet, and results in a frustrated system, 
in which not all the constraints can be satisfied at the same time. The corresponding probability 
distribution will have multiple modes. Interestingly, computing the partition function Z(J) can 
be done in polynomial time for associative Markov networks, but is NP-hard in general (Cipra 
2000). 

There is an interesting analogy between Ising models and Gaussian graphical models. First, 
assuming y, € {—1, +1}, we can write the unnormalized log probability of an Ising model as 
follows: 


: 1 
log aly) = —'} yswsy = -3y Wy (19.18) 


sot 


(The factor of i arises because we sum each edge twice.) If wst = J > 0, we get a low energy 
(and hence high probability) if neighboring states agree. 

Sometimes there is an external field, which is an energy term which is added to each spin. 
This can be modelled using a local energy term of the form —b/’y, where b is sometimes called 


3. Ernst Ising was a German-American physicist, 1900-1998. 
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a bias term. The modified distribution is given by 


N 1 
logõ(y) = J wsysu +X bsys = 3y Wy +b"y (19.19) 
sat s 
where 0 = (W, b). 
If we define p 4 —$D7'b, 5-1 = —W, and c ê su Sp, we can rewrite this in a form 


that looks similar to a Gaussian: 
- 1 _ 
Bly) x exp(—s(y — u) E (y — u) +c) (19.20) 


One very important difference is that, in the case of Gaussians, the normalization constant, 
Z = |27%, requires the computation of a matrix determinant, which can be computed in 
OD) time, whereas in the case of the Ising model, the normalization constant requires 
summing over all 2? bit vectors; this is equivalent to computing the matrix permanent, which 
is NP-hard in general Jlerrum et al. 2004). 


Hopfield networks 


A Hopfield network (Hopfield 1982) is a fully connected Ising model with a symmetric weight 
matrix, W = WT. These weights, plus the bias terms b, can be learned from training data 
using (approximate) maximum likelihood, as described in Section 19.5.4 

The main application of Hopfield networks is as an associative memory or content ad- 
dressable memory. The idea is this: suppose we train on a set of fully observed bit vectors, 
corresponding to patterns we want to memorize. Then, at test time, we present a partial pattern 
to the network. We would like to estimate the missing variables; this is called pattern com- 
pletion. See Figure 19.7 for an example. This can be thought of as retrieving an example from 
memory based on a piece of the example itself, hence the term “associative memory”. 

Since exact inference is intractable in this model, it is standard to use a coordinate descent 
algorithm known as iterative conditional modes (ICM), which just sets each node to its most 
likely (lowest energy) state, given all its neighbors. The full conditional can be shown to be 


plys = lly_s,9) = sigm(w2.y_. + bs) (19.21) 


Picking the most probable state amounts to using the rule y? = 1 if $., wsey: > bs and using 
yž = 0 otherwise. (Much better inference algorithms will be discussed later in this book.) 

Since inference is deterministic, it is also possible to interpret this model as a recurrent 
neural network. (This is quite different from the feedforward neural nets studied in Section 16.5; 
they are univariate conditional density models of the form p(y|x, @) which can only be used for 
supervised learning.) See Hertz et al. (1991) for further details on Hopfield networks. 

A Boltzmann machine generalizes the Hopfield / Ising model by including some hidden 
nodes, which makes the model representationally more powerful. Inference in such models 
often uses Gibbs sampling, which is a stochastic version of ICM (see Section 24.2 for details). 


4. ML estimation works much better than the outer product rule proposed in in (Hopfield 1982), because it not only 
lowers the energy of the observed patterns, but it also raises the energy of the non-observed patterns, in order to make 
the distribution sum to one (Hillar et al. 2012). 
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Hopfield Demo 


Training Image 


Test Image 
60% Occlusion 


Interm Result 
After 5 Iterations 


Recoverd Image 


Figure 19.7 Examples of how an associative memory can reconstruct images. These are binary images 
of size 50 x 50 pixels. Top: training images. Row 2: partially visible test images. Row 3: estimate after 
5 iterations. Bottom: final state estimate. Based on Figure 2.1 of Hertz et al. (1991). Figure generated by 
hopfieldDemo. 


(a) (b) (c) 


Figure 19.8 Visualizing a sample from a 10-state Potts model of size 128 x 128 for different association 
strengths: (a) J = 1.42, (b) J = 1.44, (c) J = 1.46. The regions are labeled according to size: blue is 
largest, red is smallest. Used with kind permission of Erik Sudderth. See gibbsDemoIsing for Matlab 
code to produce a similar plot for the Ising model. 


However, we could equally well apply Gibbs to a Hopfield net and ICM to a Boltzmann machine: 
the inference algorithm is not part of the model definition. See Section 27.7 for further details 
on Boltzmann machines. 
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Ys Ut 


vs Ut 


Figure 19.9 A grid-structured MRF with local evidence nodes. 


Potts model 


It is easy to generalize the Ising model to multiple discrete states, y € {1,2,..., A}. It is 
common to use a potential function of the following form: 


e? 0 0 
Palys) = |0 e 0 (19.22) 
0 0 ef 


This is called the Potts model.’ If J > 0, then neighboring nodes are encouraged to have the 
same label. Some samples from this model are shown in Figure 19.8. We see that for J > 1.44, 
large clusters occur, for J < 1.44, many small clusters occur, and at the critical value of 
K = 1.44, there is a mix of small and large clusters. This rapid change in behavior as we vary 
a parameter of the system is called a phase transition, and has been widely studied in the 
physics community. An analogous phenomenon occurs in the Ising model; see (MacKay 2003, 
ch 31) for details. 

The Potts model can be used as a prior for image segmentation, since it says that neighboring 
pixels are likely to have the same discrete label and hence belong to the same segment. We can 
combine this prior with a likelihood term as follows: 


ply, x|8) = p(y|Z) | [ pedu 0) = 


Z [vus u: a [Joc 9) (19.23) 


sot 


where p(xi|ye = k,@) is the probability of observing pixel 2; given that the corresponding 
segment belongs to class k. This observation model can be modeled using a Gaussian or a 
non-parametric density. (Note that we label the hidden nodes y; and the observed nodes x+, to 
be compatible with Section 19.6.) 

The corresponding graphical model is a mix of undirected and directed edges, as shown in 
Figure 19.9. The undirected 2d lattice represents the prior p(y); in addition, there are directed 
edge from each y; to its corresponding x;, representing the local evidence. Technically speak- 
ing, this combination of an undirected and directed graph is called a chain graph. However, 


5. Renfrey Potts was an Australian mathematician, 1925-2005. 
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since the x; nodes are observed, they can be “absorbed” into the model, thus leaving behind an 
undirected “backbone”. 

This model is a 2d analog of an HMM, and could be called a partially observed MRF. As 
in an HMM, the goal is to perform posterior inference, i.e., to compute (some function of) 
p(y|x, 8). Unfortunately, the 2d case is provably much harder than the Id case, and we must 
resort to approximate methods, as we discuss in later chapters. 

Although the Potts prior is adequate for regularizing supervised learning problems, it is not 
sufficiently accurate to perform image segmentation in an unsupervised way, since the segments 
produced by this model do not accurately represent the kinds of segments one sees in natural 
images (Morris et al. 1996). For the unsupervised case, one needs to use more sophisticated 
priors, such as the truncated Gaussian process prior of (Sudderth and Jordan 2008). 


Gaussian MRFs 


An undirected GGM, also called a Gaussian MRF (see e.g., (Rue and Held 2005)), is a pairwise 
MRF of the following form: 


pyle) x [vsus y) [vu (19.24) 
sot t 
1 
Ystlys: y) = exp(=5YsAsty) (19.25) 
1 
ply) = exp(—5Auyt + mye) (19.26) 


(Note that we could easily absorb the node potentials %, into the edge potentials, but we have 
kept them separate for clarity.) The joint distribution can be written as follows: 


1 
p(yl@) x exp[n y — 5y" Ay] (19.27) 


We recognize this as a multivariate Gaussian written in information form where A = X~" and 
n= Ap. 

If As = 0, then there is no pairwise term connecting s and t, so by the factorization theorem 
(Theorem 2.2.1), we conclude that 


Ys L uly- > Ase =0 (19.28) 


The zero entries in A are called structural zeros, since they represent the absent edges in the 
graph. Thus undirected GGMs correspond to sparse precision matrices, a fact which we exploit 
in Section 26.7.2 to efficiently learn the structure of the graph. 


Comparing Gaussian DGMs and UGMs * 


In Section 10.2.5, we saw that directed GGMs correspond to sparse regression matrices, and hence 
sparse Cholesky factorizations of covariance matrices, whereas undirected GGMs correspond to 


6. An influential paper (Geman and Geman 1984), which introduced the idea of a Gibbs sampler (Section 24.2), proposed 
using the Potts model as a prior for image segmentation, but the results in their paper are misleading because they did 
not run their Gibbs sampler for long enough. See Figure 24.10 for a vivid illustration of this point. 
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Figure 19.10 A VAR(2) process represented as a dynamic chain graph. Source: (Dahlhaus and Eichler 
2000). Used with kind permission of Rainer Dahlhaus and Oxford University Press. 


sparse precision matrices. The advantage of the DAG formulation is that we can make the 
regression weights W, and hence ©, be conditional on covariate information (Pourahmadi 2004), 
without worrying about positive definite constraints. The disadavantage of the DAG formulation 
is its dependence on the order, although in certain domains, such as time series, there is already 
a natural ordering of the variables. 

It is actually possible to combine both representations, resulting in a Gaussian chain graph. 
For example, consider a a discrete-time, second-order Markov chain in which the states are 
continuous, y € R?. The transition function can be represented as a (vector-valued) linear- 
Gaussian CPD: 


P(yt|¥t-1, Yt—-2, 9) = N (yt |Aryt—1 + Avyr—2, X) (19.29) 


This is called vector auto-regressive or VAR process of order 2. Such models are widely used 
in econometrics for time-series forecasting. 

The time series aspect is most naturally modeled using a DGM. However, if X7" is sparse, 
then the correlation amongst the components within a time slice is most naturally modeled 
using a UGM. For example, suppose we have 


20 = 0 0 0 0 -z 0 0 
0 5 0 —4 0 00 0 0 0 
A,=|2 3 2 O],A2=]0 0 0 0 0 (19.30) 
0 0 0 -% 5 00 4 0 4 
1 il 
00%; 0 å 00 0 0 -3 
and 
1 å 7 00 2.13 -147 -1.2 0 0 
? 1 -4 0 0 -147 213 12 0 0 
s=]5 -4 1 0 0|, 5§™=]| -12 12 18 00 (19.31) 
0 0 0 1 0 0 0 0 1 0 
0 0 0 01 0 0 0 @ 1 
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Figure 19.11 (a) A bi-directed graph. (b) The equivalent DAG. Here the w nodes are latent confounders. 
Based on Figures 5.12-5.13 of (Choi 2011). Used with kind permission of Myung Choi. 


The resulting graphical model is illustrated in Figure 19.10. Zeros in the transition matrices A; 
and A» correspond to absent directed arcs from y;_; and y;_2 into y+. Zeros in the precision 
matrix ©~' correspond to absent undirected arcs between nodes in yz. 

Sometimes we have a sparse covariance matrix rather than a sparse precision matrix. This can 
be represented using a bi-directed graph, where each edge has arrows in both directions, as in 
Figure 19.11(a). Here nodes that are not connected are unconditionally independent. For example 
in Figure 19.11(a) we see that Yı | Y3. In the Gaussian case, this means 41.3 = X31 = 0. (A 
graph representing a sparse covariance matrix is called a covariance graph.) By contrast, if 
this were an undirected model, we would have that Yı L Y3|Y2, and A1,3 = A3,; = 0, where 
ASE 

A bidirected graph can be converted to a DAG with latent variables, where each bidirected 
edge is replaced with a hidden variable representing a hidden common cause, or confounder, 
as illustrated in Figure 19.11(b). The relevant CI properties can then be determined using d- 
separation. 

We can combine bidirected and directed edges to get a directed mixed graphical model. 
This is useful for representing a variety of models, such as ARMA models (Section 18.2.4.4), 
structural equation models (Section 26.5.5), etc. 


Markov logic networks * 


In Section 10.2.2, we saw how we could “unroll” Markov models and HMMs for an arbitrary 
number of time steps in order to model variable-length sequences. Similarly, in Section 19.4.1, 
we saw how we could expand a lattice UGM to model images of any size. What about more 
complex domains, where we have a variable number of objects and relationships between them? 
Creating models for such scenarios is often done using first-order logic (see e.g., (Russell and 
Norvig 2010)). For example, consider the sentences “Smoking causes cancer” and “If two people 
are friends, and one smokes, then so does the other”. We can write these sentences in first-order 
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Friends(A,B) 


Friends(A,A) 


Cancer(A) 


Friends(B,B) 


Friends(B,A) 


Figure 19.12 An example of a ground Markov logic network represented as a pairwise MRF for 2 people. 
Based on Figure 2.1 from (Domingos and Lowd 2009). Used with kind permission of Pedro Domingos. 


logic as follows: 


Va.Sm(a) => Ca(a) (19.32) 
VaVy.Fr(a,y)A\Sm(c) => Sm(y) (19.33) 


where Sm and Ca are predicates, and Fr is a relation.’ 

Of course, such rules are not always true. Indeed, this brittleness is the main reason why 
logical approaches to AI are no longer widely used, at least not in their pure form. There 
have been a variety of attempts to combine first order logic with probability theory, an area 
known as statistical relational AI or probabilistic relational modeling (Kersting et al. 2011). 
One simple approach is to take logical rules and attach weights (known as certainty factors) to 
them, and then to interpret them as conditional probability distributions. For example, we might 
say p(Ca(x) = 1|Sm(x) = 1) = 0.9. Unfortunately, the rule does not say what to predict if 
Sm(a) = 0. Furthermore, combining CPDs in this way is not guaranteed to define a consistent 
joint distribution, because the resulting graph may not be a DAG. 

An alternative approach is to treat these rules as a way of defining potential functions in an 
unrolled UGM. The result is known as a Markov logic network (Domingos and Lowd 2009). 
To specify the network, we first rewrite all the rules in conjunctive normal form (CNF), also 
known as clausal form. In this case, we get 


aSm/(x) V Ca(x) (19.34) 
iF r(x, y) V ASm(ax) V Sm(y) (19.35) 
The first clause can be read as “Either x does not smoke or he has cancer”, which is logically 


equivalent to Equation 19.32. (Note that in a clause, any unbound variable, such as x, is assumed 
to be universally quantified.) 


7. A predicate is just a function of one argument, known as an object, that evaluates to true or false, depending on 
whether the property holds or not of that object. A (logical) relation is just a function of two or more arguments (objects) 
that evaluates to true or false, depending on whether the relationship holds between that set of objects or not. 
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Inference in first-order logic is only semi-decidable, so it is common to use a restricted subset. 
A common approach (as used in Prolog) is to restrict the language to Horn clauses, which are 
clauses that contain at most one positive literal. Essentially this means the model is a series of 
if-then rules, where the right hand side of the rules (the “then” part, or consequence) has only 
a single term. 

Once we have encoded our knowledge base as a set of clauses, we can attach weights to 
each one; these weights are the parameter of the model, and they define the clique potentials 
as follows: 


WelXe) = exp(Wedbe(Xe)) (19.36) 


where ġe(Xc) is a logical expression which evaluates clause c applied to the variables x., and 
We is the weight we attach to this clause. Roughly speaking, the weight of a clause specifies 
the probability of a world in which this clause is satsified relative to a world in which it is not 
satisfied. 

Now suppose there are two objects (people) in the world, Anna and Bob, which we will denote 
by constant symbols A and B. We can make a ground network from the above clauses by 
creating binary random variables S+, Cz, and Fy y for x,y € {A, B}, and then “wiring these 
up” according to the clauses above. The result is the UGM in Figure 19.12 with 8 binary nodes. 
Note that we have not encoded the fact that Fr is a symmetric relation, so Fr(A, B) and 
F'r(B, A) might have different values. Similarly, we have the “degenerate” nodes Fr(A, A) and 
Fr(B, B), since we did not enforce x # y in Equation 19.33. (If we add such constraints, 
then the model compiler, which generates the ground network, could avoid creating redundant 
nodes.) 

In summary, we can think of MLNs as a convenient way of specifying a UGM template, that 
can get unrolled to handle data of arbitrary size. There are several other ways to define relational 
probabilistic models; see e.g., (Koller and Friedman 2009; Kersting et al. 2011) for details. In some 
cases, there is uncertainty about the number or existence of objects or relations (the so-called 
open universe problem). Section 18.6.2 gives a concrete example in the context of multi-object 
tracking. See e.g., (Russell and Norvig 2010; Kersting et al. 2011) and references therein for further 
details. 


Learning 
In this section, we discuss how to perform ML and MAP parameter estimation for MRFs. We will 


see that this is quite computationally expensive. For this reason, it is rare to perform Bayesian 
inference for the parameters of MRFs (although see (Qi et al. 2005)). 


Training maxent models using gradient methods 


Consider an MRF in log-linear form: 


p(y|8) = aw exp (= art) (19.37) 
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where c indexes the cliques. The scaled log-likelihood is given by 


= w Doel (yil@) = moe Dorey 62 (yi) — log Z(8) 


Since MRFs are in the exponential family, we know that this function is convex in 0 (see 
Section 9.2.3), so it has a unique global maximum which we can find using gradient-based 
optimizers. In particular, the derivative for the weights of a particular clique, c, is given by 


x me ela) — log (0) (19.39) 


Exercise 19.1 asks you to show that the derivative of the log partition function wrt Oe is the 
expectation of the c’th feature under the model, i.e., 


dlogZ(0) 
rae = Lely p(y|6) (19.40) 


(19.38) 


Hence the gradient of the log likelihood is 


> È > oa) -E [ġ.(y)] 094) 


In the first term, we fix y to its observed values; this is sometimes called the clamped term. 
In the second term, y is free; this is sometimes called the unclamped term or contrastive 
term. Note that computing the unclamped term requires inference in the model, and this must 
be done once per gradient step. This makes UGM training much slower than DGM training. 

The gradient of the log likelihood can be rewritten as the expected feature vector according 
to the empirical distribution minus the model’s expectation of the feature vector: 


ol 
3b. 


= Epon [Pe(¥)] — Ep10) [6.(Y)] (19.42) 


At the optimum, the gradient will be zero, so the empirical distribution of the features will 
match the model’s predictions: 


Upemp [Pe(¥)] = Epa) [O-(¥)] (19.43) 


This is called moment matching. This observation motivates a different optimization algorithm 
which we discuss in Section 19.5.7. 


Training partially observed maxent models 


Suppose we have missing data and/or hidden variables in our model. In general, we can 
represent such models as follows: 


ply, b®) = arp exp) 47 del y) 19.44 
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The log likelihood has the form 


= ) = p Ete (Zrno) - sE (zi Dai hdo) (19.45) 


h; 


where 


Ply, h|0) = exp (Deve. of any) (19.46) 


is the unnormalized distribution. The term >°),, P(yi, hi|@) is the same as the partition function 
for the whole model, except that y is fixed at y;. Hence the gradient is just the expected features 
where we clamp y;, but average over h: 


o 
70, 128 (Ersa) = E[¢$.(b,y:)l0] (19.47) 


h; 


So the overall gradient is given by 


Ps = x D {E l6.(h,y:)10] — E [ġ.(h, y)16]} (19.48) 


The first set of expectations are computed by “clamping” the visible nodes to their observed 
values, and the second set are computed by letting the visible nodes be free. In both cases, we 
marginalize over h;. 

An alternative approach is to use generalized EM, where we use gradient methods in the M 
step. See (Koller and Friedman 2009, p956) for details. 


Approximate methods for computing the MLEs of MRFs 


When fitting a UGM there is (in general) no closed form solution for the ML or the MAP estimate 
of the parameters, so we need to use gradient-based optimizers. This gradient requires inference. 
In models where inference is intractable, learning also becomes intractable. This has motivated 
various computationally faster alternatives to ML/MAP estimation, which we list in Table 19.1. We 
dicsuss some of these alternatives below, and defer others to later sections. 


Pseudo likelihood 


One alternative to MLE is to maximize the pseudo likelihood (Besag 1975), defined as follows: 


N D 


foie x S n (y log p(yaly—a) = SDD (YialYi,-a, 0) (19.49) 


y d=1 are 


That is, we optimize the product of the full conditionals, also known as the composite likeli- 
hood (Lindsay 1988), Compare this to the objective for maximum likelihood: 


Lut (@)= Spee y log p (y|@) = Se yi|9) (19.50) 
y.x 
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Method Restriction Exact MLE? Section 

Closed form Only Chordal MRF Exact Section 19.5.7.4 

IPF Only Tabular / Gaussian MRF Exact Section 19.5.7 
Gradient-based optimization Low tree width Exact Section 19.5.1 
Max-margin training Only CRFs N/A Section 19.7 
Pseudo-likelihood No hidden variables Approximate Section 19.5.4 

Stochastic ML - Exact (up to MC error) Section 19.5.5 

Contrastive divergence - Approximate Section 27.7.2.4 
Minimum probability flow Can integrate out the hiddens Approximate Sohl-Dickstein et al. (2011) 


Table 19.1 Some methods that can be used to compute approximate ML/ MAP parameter estimates for 
MRFs/ CRFs. Low tree-width means that, in order for the method to be efficient, the graph must “tree-like”; 
see Section 20.5 for details. 


TT Ty 
Py pag 


(a) (b) 


Figure 19.13 (a) A small 2d lattice. (b) The representation used by pseudo likelihood. Solid nodes are 
observed neighbors. Based on Figure 2.2 of (Carbonetto 2003). 


In the case of Gaussian MRFs, PL is equivalent to ML (Besag 1975), but this is not true in general 
(Liang and Jordan 2008). 

The PL approach is illustrated in Figure 19.13 for a 2d grid. We learn to predict each node, 
given all of its neighbors. This objective is generally fast to compute since each full conditional 
p(Yialyi,—a, 9) only requires summing over the states of a single node, yia, in order to compute 
the local normalization constant. The PL approach is similar to fitting each full conditional 
separately (which is the method used to train dependency networks, discussed in Section 26.2.2), 
except that the parameters are tied between adjacent nodes. 

One problem with PL is that it is hard to apply to models with hidden variables (Parise and 
Welling 2005). Another more subtle problem is that each node assumes that its neighbors have 
known values. If node ¢ € nbr(s) is a perfect predictor for node s, then s will learn to rely 
completely on node ¢, even at the expense of ignoring other potentially useful information, such 
as its local evidence. 

However, experiments in (Parise and Welling 2005; Hoefling and Tibshirani 2009) suggest that 
PL works as well as exact ML for fully observed Ising models, and of course PL is much faster. 


Stochastic maximum likelihood 


Recall that the gradient of the log-likelihood for a fully observed MRF is given by 


Voll) = 5 D lly) - Eloy) asst 


a 
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The gradient for a partially observed MRF is similar. In both cases, we can approximate the 
model expectations using Monte Carlo sampling. We can combine this with stochastic gradient 
descent (Section 8.5.2), which takes samples from the empirical distribution. Pseudocode for the 
resulting method is shown in Algorithm 3. 


Algorithm 19.1: Stochastic maximum likelihood for fitting an MRF 
1 Initialize weights 0 randomly; 


2k=0,n=1; 

3 for each epoch do 

4 for each minibatch of size B do 

5 for each sample s = 1 : S do 

6 | Sample y** ~ p(y|6x) ; 
f S s,k). 

7 E(O(y)) = 3 Lea O(y*"); 

8 for each training case i in minibatch do 

9 | gi: = (y:i) — E(p(y)) ; 

10 Sk = = Sek Sik; 

u Ok+1 = Ok — Ngk; 

12 k=k+1; 

13 Decrease step size 1; 


Typically we use MCMC to generate the samples. Of course, running MCMC to convergence 
at each step of the inner loop would be extremely slow. Fortunately, it was shown in (Younes 
1989) that we can start the MCMC chain at its previous value, and just take a few steps. In 
otherwords, we sample y®* by initializing the MCMC chain at y***—', and then run for a few 
iterations. This is valid since p(y|0") is likely to be close to p(y|0"~'), since we only changed 
the parameters a small amount. We call this algorithm stochastic maximum likelihood or 
SML. (There is a closely related algorithm called persistent contrastive divergence which we 
discuss in Section 27.7.2.5.) 


Feature induction for maxent models * 


MRFs require a good set of features. One unsupervised way to learn such features, known as 
feature induction, is to start with a base set of features, and then to continually create new 
feature combinations out of old ones, greedily adding the best ones to the model. This approach 
was first proposed in (Pietra et al. 1997; Zhu et al. 1997), and was later extended to the CRF case 
in (McCallum 2003). 

To illustrate the basic idea, we present an example from (Pietra et al. 1997), which described 
how to build unconditional probabilistic models to represent English spelling. Initially the model 
has no features, which represents the uniform distribution. The algorithm starts by choosing to 
add the feature 


duly) = So (ye € {a,...,2}) (19.52) 
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which checks if any letter is lower case or not. After the feature is added, the parameters are 
(re)-fit by maximum likelihood. For this feature, it turns out that ĝi = 1.944, which means that 
a word with a lowercase letter in any position is about et:944 ~ 7 times more likely than the 
same word without a lowercase letter in that position. Some samples from this model, generated 
using (annealed) Gibbs sampling (Section 24.2), are shown below.® 


hzagh, yzop, io, advzmxnv, ijv_bolft, x, emx, kayerf, mlj, rawzyb, jp, ag, 
ctdnnnbg, wgdw, t, kguv, cy, spxcq, uzflbbf, dxtkkn, cxwx, jpd, ztzh, lv, 
zhpkvnu, 1^, r, qee, nynrx, atze4n, ik, se, w, lrh, hpt, yrqyka’h, zcngotcnx, 
igcump, zjcjs, lqpWiqu, cefmfhc, o, lb, fdcY, tzby, yopxmvk, by, fz,, t, govyccm, 
ijyiduwfzo, 6xr, duh, ejv, pk, pjw, 1, fl, w 


The second feature added by the algorithm checks if two adjacent characters are lower case: 
daly) = Sous € {a,..52,u € a. 2}) 09.53) 
sot 


Now the model has the form 


rly) = Z exp(Ordi(y) + 0202(y)) 09.58) 


Continuing in this way, the algorithm adds features for the strings s> and ing>, where > 
represents the end of word, and for various regular expressions such as [0-9], etc. Some 
samples from the model with 1000 features, generated using (annealed) Gibbs sampling, are 
shown below. 


was, reaser, in, there, to, will, ,, was, by, homes, thing, be, reloverated, 
ther, which, conists, at, fores, anditing, with, Mr., proveral, the, ,, ***, 
on’t, prolling, prothere, ,, mento, at, yaou, 1, chestraing, for, have, to, 
intrally, of, qut, ., best, compers, ***, cluseliment, uster, of, is, deveral, 
this, thise, of, offect, inatever, thifer, constranded, stater, vill, in, thase, 
in, youse, menttering, and, ., of, in, verate, of, to 


This approach of feature learning can be thought of as a form of graphical model structure 
learning (Chapter 26), except it is more fine-grained: we add features that are useful, regardless 
of the resulting graph structure. However, the resulting graphs can become densely connected, 
which makes inference (and hence parameter estimation) intractable. 


Iterative proportional fitting (IPF) * 


Consider a pairwise MRF where the potentials are represented as tables, with one parameter per 
variable setting. We can represent this in log-linear form using 


Yst(Ys, Yt) = EXP (O72, [ls =1,y% =1),...,Iys = K, y = K)]) (19.55) 
and similarly for y,(y;). Thus the feature vectors are just indicator functions. 


8. We thank John Lafferty for sharing this example. 
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From Equation 19.43, we have that, at the maximum of the likelihood, the empirical expectation 
of the features equals the model’s expectation: 


“pemp (Ys = ju =k) = Epcjey ys = j, y = k)] (19.56) 
Perap (Ve ds Yt k) P(Ys J, Yt k\@) (19.57) 


where Pemp is the empirical probability: 


N š 
: Nst,j n= IYns = Yn =k 
Pomp (Ys =IR= k) = a = 2 =l € N A ) (19.58) 


For a general graph, the condition that must hold at the optimum is 


Pemp(Ye) = p(y-|@) (19.59) 


For a special family of graphs known as decomposable graphs (defined in Section 20.4.1), one 
can show that p(y.|@) = w-(y.). However, even if the graph is not decomposable, we can 
imagine trying to enforce this condition. This suggests an iterative coordinate ascent scheme 
where at each step we compute 


we (yc) = Wve) X ph (19.60) 


where the multiplication is elementwise. This is known as iterative proportional fitting or IPF 
(Fienberg 1970; Bishop et al. 1975). See Algorithm 7 for the pseudocode. 


Algorithm 19.2: Iterative Proportional Fitting algorithm for tabular MRFs 
1 Initialize Ye = 1 for c = 1 : C; 


2 repeat 

3 for c= 1 : C do 

4 Pe = p(y cY); 

5 Pe = Pemp(Yc); 
6 We = We * Pe ; 


7 until converged; 


Example 


Let us consider a simple example from http://en.wikipedia.org/wiki/Iterative_propo 
rtional_fitting. We have two binary variables, Yı and Y>, where Y,,; = 1 if man n is left 
handed, and Y„ı = 0 otherwise; similarly, Y,2 = 1 if woman n is left handed, and Y,2 = 0 
otherwise. We can summarize the data using the following 2 x 2 contingency table: 
right-handed left-handed | Total 

male 43 9 52 

female | 44 4 48 

Total 87 13 100 


19.5.7.2 
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Suppose we want to fit a disconnected graphical model containing nodes Y, and Y> but with 
no edge between them. That is, we want to find vectors ¢, and Y», such that M £ piy} ~ C, 
where M are the model’s expected counts, and C are the empirical counts. By moment 
matching, we find that the row and column sums of the model must exactly match the row 
and column sums of the data. One possible solution is to use q, = [0.5200, 0.4800] and 
pa = [87, 13]. Below we show the model's predictions, M = 4,43. 


right-handed left-handed | Total 
male 45.24 6.76 52 
female | 41.76 6.24 48 
Total 87 13 100 


It is easy to see that this matches the required constraints. See IPFdemo2x2 for some Matlab 
code that computes these numbers. This method is easily to generalized to arbitrary graphs. 


Speed of IPF 


IPF is a fixed point algorithm for enforcing the moment matching constraints and is guaranteed 
to converge to the global optimum (Bishop et al. 1975). The number of iterations depends on the 
form of the model. If the graph is decomposable, then IPF converges in a single iteration, but in 
general, IPF may require many iterations. 

It is clear that the dominant cost of IPF is computing the required marginals under the model. 
Efficient methods, such as the junction tree algorithm (Section 20.4), can be used, resulting in 
something called efficient IPF (Jirousek and Preucil 1995). 

Nevertheless, coordinate descent can be slow. An alternative method is to update all the 
parameters at once, by simply following the gradient of the likelihood. This gradient approach 
has the further significant advantage that it works for models in which the clique potentials may 
not be fully parameterized, i.e., the features may not consist of all possible indicators for each 
clique, but instead can be arbitrary. Although it is possible to adapt IPF to this setting of general 
features, resulting in a method known as iterative scaling, in practice the gradient method is 
much faster (Malouf 2002; Minka 2003). 


Generalizations of IPF 


We can use IPF to fit Gaussian graphical models: instead of working with empirical counts, we 
work with empirical means and covariances (Speed and Kiiveri 1986). It is also possible to create 
a Bayesian IPF algorithm for sampling from the posterior of the model’s parameters (see e.g., 
(Dobra and Massam 2010)). 


IPF for decomposable graphical models 


There is a special family of undirected graphical models known as decomposable graphical 
models. This is formally defined in Section 20.4.1, but the basic idea is that it contains graphs 
which are “tree-like’. Such graphs can be represented by UGMs or DGMs without any loss of 
information. 

In the case of decomposable graphical models, IPF converges in one iteration. In fact, the 
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MLE has a closed form solution (Lauritzen 1996). In particular, for tabular potentials we have 


E Myvi = k) 


belye = k) = 19.61 
YPely ) V (19.61) 
and for Gaussian potentials, we have 
O Daye a Eii Ae) (Xie — Ae)? 
oR $,- ui , 19.62 
He N N (19.62) 


By using conjugate priors, we can also easily compute the full posterior over the model pa- 
rameters in the decomposable case, just as we did in the DGM case. See (Lauritzen 1996) for 
details. 


Conditional random fields (CRFs) 


A conditional random field or CRF (Lafferty et al. 2001), sometimes a discriminative random 
field (Kumar and Hebert 2003), is just a version of an MRF where all the clique potentials are 
conditioned on input features: 


ply|x, w) = EE [[ ve 


,w 19.63 
Z(x 68) ( 
A CRF can be thought of as a structured output extension of logistic regression. We will usually 
assume a log-linear representation of the potentials: 


Wel(yelX, w) = exp(we (x, yc)) (19.64) 


where (x, yc) is a feature vector derived from the global inputs x and the local set of labels 
Yc. We will give some examples below which will make this notation clearer. 

The advantage of a CRF over an MRE is analogous to the advantage of a discriminative 
classifier over a generative classifier (see Section 8.6), namely, we don’t need to “waste resources” 
modeling things that we always observe. Instead we can focus our attention on modeling what 
we care about, namely the distribution of labels given the data. 

Another important advantage of CRFs is that we can make the potentials (or factors) of the 
model be data-dependent. For example, in image processing applications, we may “turn off” the 
label smoothing between two neighboring nodes s and t if there is an observed discontinuity in 
the image intensity between pixels s and t. Similarly, in natural language processing problems, 
we can make the latent labels depend on global properties of the sentence, such as which 
language it is written in. It is hard to incorporate global features into generative models. 

The disadvantage of CRFs over MRFs is that they require labeled training data, and they 
are slower to train, as we explain in Section 19.6.3. This is analogous to the strengths and 
weaknesses of logistic regression vs naive Bayes, discussed in Section 8.6. 


Chain-structured CRFs, MEMMs and the label-bias problem 


The most widely used kind of CRF uses a chain-structured graph to model correlation amongst 
neighboring labels. Such models are useful for a variety of sequence labeling tasks (see Sec- 
tion 19.6.2). 
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Ure Ut Yt+1 Tg Ty 
Yt+1 Yt-1 Yt+ı 
4 Yt Yt 
Tt—1 Tt+1 Ti Cry Lt Ti Tt+1 
(a) (b) (c) 


Figure 19.14 Various models for sequential data. (a) A generative directed HMM. (b) A discriminative 
directed MEMM. (c) A discriminative undirected CRF. 


Traditionally, HMMs (discussed in detail in Chapter 17) have been used for such tasks. These 
are joint density models of the form 


T 


p ylw) = [pun w)p(xelye, w) (19.65) 
t=1 


where we have dropped the initial p(y1) term for simplicity. See Figure 19.14(a). If we observe 
both x, and y+ for all t, it is very easy to train such models, using techniques described in 
Section 17.5.1. 

An HMM requires specifying a generative observation model, p(x;|y,,w), which can be 
difficult. Furthemore, each x; is required to be local, since it is hard to define a generative 
model for the whole stream of observations, x = x}.7. 

An obvious way to make a discriminative version of an HMM is to “reverse the arrows” from 
yz to Xz, as in Figure 19.14(b). This defines a directed discriminative model of the form 


ply|x, w) = | [ p(vlye—1,x, w) (19.66) 
t 


where x = (X1:7,X,), Xg are global features, and x; are features specific to node t. (This 
partition into local and global is not necessary, but helps when comparing to HMMs.) This is 
called a maximum entropy Markov model or MEMM (McCallum et al. 2000; Kakade et al. 
2002). 

An MEMM is simply a Markov chain in which the state transition probabilities are conditioned 
on the input features. (It is therefore a special case of an input-output HMM, discussed in 
Section 17.6.3.) This seems like the natural generalization of logistic regression to the structured- 
output setting, but it suffers from a subtle problem known (rather obscurely) as the label bias 
problem (Lafferty et al. 2001). The problem is that local features at time t do not influence states 
prior to time t. This follows by examining the DAG, which shows that x, is d-separated from 
yt—1 (and all earlier time points) by the v-structure at y+, which is a hidden child, thus blocking 
the information flow. 

To understand what this means in practice, consider the part of speech (POS) tagging task. 
Suppose we see the word “banks”; this could be a verb (as in “he banks at BoA”), or a noun (as 
n “the river banks were overflowing”). Locally the POS tag for the word is ambiguous. However, 
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(a) (b) (c) (d) (e) 


Figure 19.15 Example of handwritten letter recognition. In the word ’brace’, the ’r’ and the ’c’ look very 
similar, but can be disambiguated using context. Source: (Taskar et al. 2003) . Used with kind permission 
of Ben Taskar. 


suppose that later in the sentence, we see the word “fishing”; this gives us enough context to 
infer that the sense of “banks” is “river banks”. However, in an MEMM (unlike in an HMM and 
CRF), the “fishing” evidence will not flow backwards, so we will not be able to disambiguate 


“banks”. 


Now consider a chain-structured CRF. This model has the form 


T T-1 
1 
p(y|x, w) = Z(x,w) [] ols.) Il VY, YerilX, W) (19.67) 


From the graph in Figure 19.14(c), we see that the label bias problem no longer exists, since y 
does not block the information from x+ from reaching other y+ nodes. 

The label bias problem in MEMMs occurs because directed models are locally normalized, 
meaning each CPD sums to 1. By contrast, MRFs and CRFs are globally normalized, which 
means that local factors do not need to sum to 1, since the partition function Z, which sums over 
all joint configurations, will ensure the model defines a valid distribution. However, this solution 
comes at a price: we do not get a valid probability distribution over y until we have seen 
the whole sentence, since only then can we normalize over all configurations. Consequently, 
CRFs are not as useful as DGMs (whether discriminative or generative) for online or real-time 
inference. Furthermore, the fact that Z depends on all the nodes, and hence all their parameters, 
makes CRFs much slower to train than DGMs, as we will see in Section 19.6.3. 


Applications of CRFs 


CRFs have been applied to many interesting problems; we give a representative sample below. 
These applications illustrate several useful modeling tricks, and will also provide motivation for 
some of the inference techniques we will discuss in Chapter 20. 


Handwriting recognition 


A natural application of CRFs is to classify hand-written digit strings, as illustrated in Figure 19.15. 
The key observation is that locally a letter may be ambiguous, but by depending on the (un- 
known) labels of one’s neighbors, it is possible to use context to reduce the error rate. Note 
that the node potential, w,(y;|xz), is often taken to be a probabilistic discriminative classifier, 
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British Airways rose after announcing its withdrawal from the UAL deal 
KEY 
B Begin noun phrase Vv Verb 
I Within noun phrase IN Preposition 
O Not a noun phrase PRP Possesive pronoun 
N Noun DT Determiner (e.g., a, an, the) 
ADJ Adjective 


Figure 19.16 A CRF for joint POS tagging and NP segmentation. Source: Figure 4.E.1 of (Koller and 
Friedman 2009). Used with kind permission of Daphne Koller. 


such as a neural network or RVM, that is trained on isolated letters, and the edge potentials, 
Ust(Ys, Ye), are often taken to be a language bigram model. Later we will discuss how to train 
all the potentials jointly. 


19.6.2.2 Noun phrase chunking 


One common NLP task is noun phrase chunking, which refers to the task of segmenting a 
sentence into its distinct noun phrases (NPs). This is a simple example of a technique known as 
shallow parsing. 

In more detail, we tag each word in the sentence with B (meaning beginning of a new NP), I 
(meaning inside a NP), or O (meaning outside an NP). This is called BIO notation. For example, 
in the following sentence, the NPs are marked with brackets: 


B I 0 0 0 B I 0 B I 1 
(British Airways) rose after announcing (its withdrawl) from (the UAI deal) 


(We need the B symbol so that we can distinguish I I, meaning two words within a single NP, 
from B B, meaning two separate NPs.) 

A standard approach to this problem would first convert the string of words into a string of 
POS tags, and then convert the POS tags to a string of BIOs. However, such a pipeline method 
can propagate errors. A more robust approach is to build a joint probabilistic model of the 
form p(NP1:r, POS;.7|words,.7). One way to do this is to use the CRF in Figure 19.16. The 
connections between adjacent labels encode the probability of transitioning between the B, I 
and O states, and can enforce constraints such as the fact that B must preceed I. The features 
are usually hand engineered and include things like: does this word begin with a capital letter, is 
this word followed by a full stop, is this word a noun, etc. Typically there are ~ 1, 000 — 10,000 
features per node. 

The number of features has minimal impact on the inference time, since the features are 
observed and do not need to be summed over. (There is a small increase in the cost of 
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B-PER 


Mrs. Green spoke today in New York Green chairs the finance committee 


B-PER Begin person name I-LOC Within location name 
I-PER Within person name OTH Notan entitiy 
B-LOC Begin location name 


Figure 19.17 A skip-chain CRF for named entity recognition. Source: Figure 4.E.1 of (Koller and Friedman 
2009). Used with kind permission of Daphne Koller. 


evaluating potential functions with many features, but this is usually negligible; if not, one can 
use fı regularization to prune out irrelevant features.) However, the graph structure can have a 
dramatic effect on inference time. The model in Figure 19.16 is tractable, since it is essentially a 
“fat chain”, so we can use the forwards-backwards algorithm (Section 17.4.3) for exact inference 
in O(T|POS|?|NP|?) time, where |POS| is the number of POS tags, and |NP| is the number 
of NP tags. However, the seemingly similar graph in Figure 19.17, to be explained below, is 
computationally intractable. 


Named entity recognition 


A task that is related to NP chunking is named entity extraction. Instead of just segmenting 
out noun phrases, we can segment out phrases to do with people and locations. Similar 
techniques are used to automatically populate your calendar from your email messages; this is 
called information extraction. 

A simple approach to this is to use a chain-structured CRF, but to expand the state space 
from BIO to B-Per, I-Per, B-Loc, I-Loc, and Other. However, sometimes it is ambiguous whether 
a word is a person, location, or something else. (Proper nouns are particularly difficult to deal 
with because they belong to an open class, that is, there is an unbounded number of possible 
names, unlike the set of nouns and verbs, which is large but essentially fixed.) We can get better 
performance by considering long-range correlations between words. For example, we might add 
a link between all occurrences of the same word, and force the word to have the same tag in 
each occurence. (The same technique can also be helpful for resolving the identity of pronouns.) 
This is known as a skip-chain CRF. See Figure 19.17 for an illustration. 

We see that the graph structure itself changes depending on the input, which is an additional 
advantage of CRFs over generative models. Unfortunately, inference in this model is gener- 
ally more expensive than in a simple chain with local connections, for reasons explained in 
Section 20.5. 
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X| The dog chased the cat ; j e 
TXF} 2|NP > Det N 
y = 5 = 1| VP >V NP 
NP VP Wi(x,y)= 
= 0 | Det — dog 
L y Z DE 2| Det > the 
Det—N— V— Det __N 1| N > dog 
l l ; i | 1 | V > chased 
The dog chased the cat 1) N-cat 


Figure 19.18 Illustration of a simple parse tree based on a context free grammar in Chomsky normal 
form. The feature vector @(x, y) = U(x, y) counts the number of times each production rule was used. 
Source: Figure 5.2 of (Altun et al. 2006) . Used with kind permission of Yasemin Altun. 


Natural language parsing 


A generalization of chain-structured models for language is to use probabilistic grammars. In 
particular, a probabilistic context free grammar or PCFG is a set of re-write or production 
rules of the form o —> o'o” or o —> 2, where o,0',o’’ € X are non-terminals (analogous to 
parts of speech), and x € ¥ are terminals, i.e., words. See Figure 19.18 for an example. Each 
such rule has an associated probability. The resulting model defines a probability distribution 
over sequences of words. We can compute the probability of observing a particular sequence 
x = x,...x7 by summing over all trees that generate it. This can be done in O(T?) time 
using the inside-outside algorithm; see e.g., Qurafsky and Martin 2008; Manning and Schuetze 
1999) for details. 

PCFGs are generative models. It is possible to make discriminative versions which encode 
the probability of a labeled tree, y, given a sequence of words, x, by using a CRF of the form 
p(y|x) x exp(w! @(x,y)). For example, we might define #(x,y) to count the number of 
times each production rule was used (which is analogous to the number of state transitions in 
a chain-structured model). See e.g., (Taskar et al. 2004) for details. 


Hierarchical classification 


Suppose we are performing multi-class classification, where we have a label taxonomy, which 
groups the classes into a hierarchy. We can encode the position of y within this hierarchy by 
defining a binary vector @(y), where we turn on the bit for component y and for all its children. 
This can be combined with input features @(x) using a tensor product, (x, y) = o(x) @ f(y). 
See Figure 19.19 for an example. 

This method is widely used for text classification, where manually constructed taxnomies 
(such as the Open Directory Project at www.dmoz.org) are quite common. The benefit is that 
information can be shared between the parameters for nearby categories, enabling generalization 
across classes. 
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RIA 


(w, U(x, 2)) = (w2,x 6; X) + (wo, X) 


Oroord@qcodqocroe 
Onn TOR, COCK OC 


Figure 19.19 Illustration of a simple label taxonomy, and how it can be used to compute a distributed 
representation for the label for class 2. In this figure, @(x) = x, (y = 2) = A(2), (x, y) is denoted 
by U(x, 2), and w7 (x, y) is denoted by (w, U(x, 2)). Source: Figure 5.1 of (Altun et al. 2006) . Used 
with kind permission of Yasemin Altun. 


Protein side-chain prediction 


An interesting analog to the skip-chain model arises in the problem of predicting the structure 
of protein side chains. Each residue in the side chain has 4 dihedral angles, which are usually 
discretized into 3 values called rotamers. The goal is to predict this discrete sequence of angles, 
y, from the discrete sequence of amino acids, x. 

We can define an energy function E(x,y), where we include various pairwise interaction 
terms between nearby residues (elements of the y vector). This energy is usually defined as a 
weighted sum of individual energy terms, E(x, y|w) = Da 0j E;(x, y), where the E; are 
energy contribution due to various electrostatic charges, hydrogen bonding potentials, etc, and 
w are the parameters of the model. See (Yanover et al. 2007) for details. 

Given the model, we can compute the most probable side chain configuration using y* = 
argmin E(x, y|w). In general, this problem is NP-hard, depending on the nature of the graph 
induced by the Æ; terms, due to long-range connections between the variables. Nevertheless, 
some special cases can be efficiently handled, using methods discussed in Section 22.6. 


Stereo vision 


Low-level vision problems are problems where the input is an image (or set of images), and 
the output is a processed version of the image. In such cases, it is common to use 2d lattice- 
structured models; the models are similar to Figure 19.9, except that the features can be global, 
and are not generated by the model. We will assume a pairwise CRF. 

A classic low-level vision problem is dense stereo reconstruction, where the goal is to 
estimate the depth of every pixel given two images taken from slightly different angles. In this 
section (based on (Sudderth and Freeman 2008)), we give a sketch of how a simple CRF can be 
used to solve this task. See e.g., (Sun et al. 2003) for a more sophisticated model. 

By using some standard preprocessing techniques, one can convert depth estimation into a 
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problem of estimating the disparity ys between the pixel at location (is, js) in the left image 
and the corresponding pixel at location (is + ys,js) in the right image. We typically assume 
that corresponding pixels have similar intensity, so we define a local node potential of the form 


1 
Ws(Ys|x) x exp l-z (zrlis, js) — £rlis + TO (19.68) 


where zz is the left image and zp is the right image. This equation can be generalized to model 
the intensity of small windows around each location. In highly textured regions, it is usually 
possible to find the corresponding patch using cross correlation, but in regions of low texture, 
there will be considerable ambiguity about the correct value of ys. 

We can easily add a Gaussian prior on the edges of the MRF that encodes the assumption 
that neighboring disparities ys, y¢ should be similar, as follows: 


1 
se(Yss Ye) X EXP (zos = m?) (19.69) 
The resulting model is a Gaussian CRF. 
However, using Gaussian edge-potentials will oversmooth the estimate, since this prior fails 
to account for the occasional large changes in disparity that occur between neighboring pixels 
which are on different sides of an occlusion boundary. One gets much better results using a 


truncated Gaussian potential of the form 


1 : 
Wst (Ys, Yt) X exp l-5 min ((ys — y+)’, 2) } (19.70) 


where y encodes the expected smoothness, and ôo encodes the maximum penalty that will 
be imposed if disparities are significantly different. This is called a discontinuity preserving 
potential; note that such penalties are not convex. The local evidence potential can be made 
robust in a similar way, in order to handle outliers due to specularities, occlusions, etc. 

Figure 19.20 illustrates the difference between these two forms of prior. On the top left is an 
image from the standard Middlebury stereo benchmark dataset (Scharstein and Szeliski 2002). 
On the bottom left is the corresponding true disparity values. The remaining columns represent 
the estimated disparity after 0, 1 and an “infinite” number of rounds of loopy belief propagation 
(see Section 22.2), where by “infinite” we mean the results at convergence. The top row shows 
the results using a Gaussian edge potential, and the bottom row shows the results using the 
truncated potential. The latter is clearly better. 

Unfortunately, performing inference with real-valued variables is computationally difficult, 
unless the model is jointly Gaussian. Consequently, it is common to discretize the variables. 
(For example, Figure 19.20(bottom) used 50 states.) The edge potentials still have the form given 
in Equation 19.69. The resulting model is called a metric CRF, since the potentials form a 
metric. ° Inference in metric CRFs is more efficient than in CRFs where the discrete labels 
have no natural ordering, as we explain in Section 22.6.3.3. See Section 22.6.4 for a comparison 
of various approximate inference methods applied to low-level CRFs, and see (Blake et al. 2011; 
Prince 2012) for more details on probabilistic models for computer vision. 


9. A function f is said to be a metric if it satisfies the following three properties: Reflexivity: f(a,b) = 0 iff a = b; 
Symmetry: f(a,b) = f(b,a); and Triangle inequality: f(a,b) + f(b,c) > f(a,c). If f satisfies only the first two 
properties, it is called a semi-metric. 
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True Disparities 


Figure 19.20 Illustration of belief propagation for stereo depth estimation. Left column: image and true 
disparities. Remaining columns: initial estimate, estimate after 1 iteration, and estimate at convergence. 
Top row: Gaussian edge potentials. Bottom row: robust edge potentials. Source: Figure 4 of (Sudderth and 
Freeman 2008). Used with kind permission of Erik Sudderth. 


CRF training 


We can modify the gradient based optimization of MRFs described in Section 19.5.1 to the CRF 
case in a straightforward way. In particular, the scaled log-likelihood becomes 


1 1 - 
25 3 log p(yilxi, w) = 55 >, £ we b.(yi, xi) — log Z(w, xi) (19.71) 
and the gradient becomes 
ae ð 
on. Oe 7 x (yi, Xi) — ae log z(w,x:) (19.72) 
= 3 yl (Vi; Xi) — E[d.(y, xs) ]] (19.73) 


Note that we now have to perform inference for every single training case inside each gradient 
step, which is O(N) times slower than the MRF case. This is because the partition function 
depends on the inputs x;. 

In most applications of CRFs (and some applications of MRFs), the size of the graph structure 
can vary. Hence we need to use parameter tying to ensure we can define a distribution of 
arbitrary size. In the pairwise case, we can write the model as follows: 


p(y|x,w) = a (w’ oy, x)) (19.74) 


1 
Z(w,x 


19.7 
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where w = [W,,, we] are the node and edge parameters, and 


Ply, x) = D Pi (Yt, x), D Pst (Ys, Yt, x)] (19.75) 
t sot 
are the summed node and edge features (these are the sufficient statistics). The gradient 
expression is easily modified to handle this case. 
In practice, it is important to use a prior/ regularization to prevent overfitting. If we use a 
Gaussian prior, the new objective becomes 


1 ; 
(w) = = > Vlog p(yilxi, w) — Allw|3 (19.76) 


It is simple to modify the gradient expression. 

Alternatively, we can use £; regularization. For example, we could use ¢; for the edge weights 
We to learn a sparse graph structure, and l> for the node weights wn, as in (Schmidt et al. 
2008). In other words, the objective becomes 


1 
lw) £ = > logplyilx w) — Aa||well1 — A2||Wnll3 (19.77) 


Unfortunately, the optimization algorithms are more complicated when we use 44 (see Sec- 
tion 13.4), although the problem is still convex. 

To handle large datasets, we can use stochastic gradient descent (SGD), as described in 
Section 8.5.2. 

It is possible (and useful) to define CRFs with hidden variables, for example to allow for an 
unknown alignment between the visible features and the hidden labels (see e.g., (Schnitzspan 
et al. 2010)). In this case, the objective function is no longer convex. Nevertheless, we can find 
a locally optimal ML or MAP parameter estimate using EM and/ or gradient methods. 


Structural SVMs 


We have seen that training a CRF requires inference, in order to compute the expected sufficient 
statistics needed to evaluate the gradient. For certain models, computing a joint MAP estimate 
of the states is provably simpler than computing marginals, as we discuss in Section 22.6. In this 
section, we discuss a way to train structured output classifiers that that leverages the existence of 
fast MAP solvers. (To avoid confusion with MAP estimation of parameters, we will often refer to 
MAP estimation of states as decoding.) These methods are known as structural support vector 
machines or SSVMs (Tsochantaridis et al. 2005). (There is also a very similar class of methods 
known as max margin Markov networks or M3nets (Taskar et al. 2003); see Section 19.7.2 for 
a discussion of the differences.) 


SSVMs: a probabilistic view 


In this book, we have mostly concentrated on fitting models using MAP parameter estimation, 
i.e., by minimizing functions of the form 


N 
Rmap(w) = — log p(w) — 5 log p(yi|xi, w) (19.78) 
i=l 
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However, at test time, we pick the label so as to minimize the posterior expected loss (defined 
in Section 5.7): 
y(x|w) = argmin > L(y, y)p(y|x, w) (19.79) 
y y 
where L(y*,y) is the loss we incur when we estimate y but the truth is y*. It therefore seems 
reasonable to take the loss function into account when performing parameter estimation.!° So, 


following (Yuille and He 2011), let us instead minimized the posterior expected loss on the 
training set: 


Ret(w) ê — log p(w) + )+ Sos a (yi, ¥)p(y|x:, w) (19.80) 


In the special case of 0-1 loss, L(y;, y) = 1 — dy.y,, this reduces to Rm Ap. 
We will assume that we can write our model in the following form: 


exp(w? (x, y)) 


P(y|x, Ww) E (19.81) 
p(w) = Sa (19.82) 


where Z(x,w) = ay exp(w’ $(x,y)). Also, let us define L(y;,y) = exp Lyi,y). With 
this, we can rewrite our objective as follows: 


Rert(w) = —logp(w)+ )+ > log £ exp L( yi y) (19.83) 


y 


exp(w’ (x, y)) 
Z(x,w) 


i 


E(w JEZ z Gaw) jaia eni vyi y) tw! (x,y ) 19.84 


We will now consider various bounds in order to simplify this objective. First note that for 
any function f(y) we have 


max f(y ) < log 5 exp[f(y)] < log jo exp (max tly ))| = log |V| + max f(y) (19.85) 
y 


ve yey 
For example, suppose Y = {0,1,2} and f(y) = y. Then we have 
2 = log[exp(2)] < loglexp(0) + exp(1) + exp(2)] < log[3 x exp(2)] = log(3) + 2 (19.86) 


We can ignore the log || term, which is independent of y, and treat maxycy f(y) as both a 
lower and upper bound. Hence we see that 


= T , -n T z 
Rer(w) wey [max {Ly + wolk y)} ~ maxw" doxi.¥)] 09.80 


10. Note that this violates the fundamental Bayesian distinction between inference and decision making. However, 
performing these tasks separately will only result in an optimal decision if we can compute the exact posterior. In most 
cases, this is intractable, so we need to perform loss-calibrated inference (Lacoste-Julien et al. 2011). In this section, 
we just perform loss-calibrated MAP parameter estimation, which is computationally simpler. (See also (Stoyanov et al. 
201)).) 
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where x ~ y means cı +g < y + c2 for some constants c;, C2. Unfortunately, this objective 
is not convex in w. However, we can devise a convex upper bound by exploiting the following 
looser lower bound on the log-sum-exp function: 


f(y’) < log X` explf(y)] (19.88) 


for any y’ € V. Applying this equation to our earlier example, for f(y) = y and y’ = 1, we get 
1 = log[exp(1)] < log[exp(0) + exp(1) + exp(2)]. And applying this bound to Rgz we get 


Rer(w) < E(w)+ S [mex {ilyi.y) + w b(xi,y)} -woi y) (19.89) 


If we set E(w) = —5¢||w]||3 (corresponding to a spherical Gaussian prior), we get 


N 
1 E 
Rssvm(w) ê Wir + c> [mex {ilyi.y) + w" $(xi,y)} = wT øn ¥0N9.90) 
i=1 


This is the same objective as used in the SSVM approach of (Tsochantaridis et al. 2005). 
In the special case that Y = {—1, +1} L(y*,y) = 1 — dyy*, and o(x,y) = yx, this 
criterion reduces to the following (by considering the two cases that y = y; and y Æ yi): 


N 
1 
Rsvu(w) ê 5 llwll? +C X [max{0,1— yw? x;}] (19.91) 
=I 


which is the standard binary SVM objective (see Equation 14.57). 

So we see that the SSVM criterion can be seen as optimizing an upper bound on the Bayesian 
objective, a result first shown in (Yuille and He 2011). This bound will be tight (and hence 
the approximation will be a good one) when ||w|| is large, since in that case, p(y|x, w) will 
concentrate its mass on argmaxy p(y|x,w). Unfortunately, a large ||w|| corresponds to a 
model that is likely to overfit, so it is unlikely that we will be working in this regime (because we 
will tune the strength of the regularizer to avoid this situation). An alternative justification for the 
SVM criterion is that it focusses effort on fitting parameters that affect the decision boundary. 
This is a better use of computational resources than fitting the full distribution, especially when 
the model is wrong. 


SSVMs: a non-probabilistic view 


We now present SSVMs in a more traditional (non-probabilistic) way, following (Tsochantaridis 
et al. 2005). The resulting objective will be the same as the one above. However, this derivation 
will set the stage for the algorithms we discuss below. 

Let f(x;w) = argmaxy<y w’ (x,y) be the prediction function. We can obtain zero loss 
on the training set using this predictor if 


Vi. max w" $(xi,y) < w’ (xi, yi) (19.92) 
yeV\yi 
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Each one of these nonlinear inequalities can be equivalently replaced by |V| — 1 linear inequal- 
ities, resulting in a total of N|Y|— N linear constraints of the following form: 


Villy € Y \ yw $(xi, yi) -wex y) > 0 (19.93) 
For brevity, we introduce the notation 


ily) = (xi yi) — xiy) (19.94) 


so we can rewrite these constraints as w7 6;(y) > 0. 
If we can achieve zero loss, there will typically be multiple solution vectors w. We pick the 
one that maximizes the margin, defined as 
ymin f(x,yi;w) — max f(x,y’; w) (19.95) 
a yEV\y 
Since the margin can be made arbitrarily large by rescaling w, we fix its norm to be 1, resulting 
in the optimization problem 


max st. ViNy €V\yi. wi di(y) > 7 (19.96) 


yw:||w||=1 


Equivalently, we can write 
1 
min 5 twll? st. Viy €V\yi. w'di(y) >1 (19.97) 
w 


To allow for the case where zero loss cannot be achieved (equivalent to the data being inseparable 
in the case of binary classification), we relax the constraints by introducing slack terms €;, one 
per data case. This yields 


N 
min Ljw]? +O & st. Vivy EV\ yr w6i(y) >1-6,6 20 (19.98) 
we 2 i=1 
In the case of structured outputs, we don’t want to treat all constraint violations equally. For 
example, in a segmentation problem, getting one position wrong should be punished less than 
getting many positions wrong. One way to achieve this is to divide the slack variable by the size 
of the loss (this is called slack re-scaling). This yields 


Ši 


N 
1 2 - Jy 
min -||wl +0) & st. ViVy E€ VY\ y:i w d;(y) > 1—- ——~ 
w,é al | D \ v) Lyi, y) 


i=l 


„&i > 0 (19.99) 


Alternatively, we can define the margin to be proportional to the loss (this is called margin 
re-rescaling). This yields 


N 
1 
min zwi? +C53°& st. Vivy €Y\ yr w"di(y) > Lya y)— £i & = 009100) 
W, 
i i=1 
(In fact, we can write Vy € Y instead of Vy € Y \ yi, since if y = y;, then w’6;(y) = 0 and 
€; = 0. By using the simpler notation, which doesn’t exclude y;, we add an extra but redundant 
constraint.) This latter approach is used in M3nets. 
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For future reference, note that we can solve for the €* terms as follows: 


a 


€*(w) = max{0, max(L(yi, y) — wT 6;))} = max(L(yi, y) — wT 6;)) (19.101) 
y y 
Substituting in, and dropping the constraints, we get the following equivalent problem: 


man 5? + CD max { (Yi y) tw Toxi, y )} =- w Tolxi yi) (19.102) 


Empirical risk minimization 


Let us pause and consider whether the above objective is reasonable. Recall that in the frequen- 
tist approach to machine learning (Section 6.5), the goal is to minimize the regularized empirical 


risk, defined by 


N 


C 
R(w) + + 3 L(yi, f(x, w)) (19.103) 
= 
where R(w) is the regularizer, and f(x;,w) = argmaxy w7 #(x;,y) = Ji is the prediction. 
Since this objective is hard to optimize, because the loss is not differentiable, we will construct 
a convex upper bound instead. 
We can show that 


Vibe oe max(L (yi,y) — wT 6;)) (19.104) 


is such a convex upper bound. To see this, note that 


L(yi, f(%i,w)) < Lyi, f(x, w)) — w (xi, yi) +w" p(x, Ki) (19.105) 
< max L(yn y) — w $(x;, yi) + Ww’ olx y) (19.106) 


Using this bound and R(w) = 4 ||w]||? yields Equation 19.102. 


Computational issues 


Although the above objectives are simple quadratic programs (QP), they have O(N|YV|) con- 
straints. This is intractable, since V is usually exponentially large. In the case of the margin 
rescaling formulation, it is possible to reduce the exponential number of constraints to a poly- 
nomial number, provided the loss function and the feature vector decompose according to a 
graphical model. This is the approach used in M3nets (Taskar et al. 2003). 

An alternative approach is to work directly with the exponentially sized QP. This allows for 
the use of more general loss functions. There are several possible methods to make this feasible. 
One is to use cutting plane methods. Another is to use stochastic subgradient methods. We 
discuss both of these below. 


19.7.3 
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Iw Tl = 1 


Figure 19.21 Illustration of the cutting plane algorithm in 2d. We start with the estimate w = wo = 0. 
(a) We add the first constraint; the shaded region is the new feasible set. The new minimum norm solution 
is wi. (b) We add another constraint; the dark shaded region is the new feasible set. (c) We add a third 
constraint. Source: Figure 5.3 of (Altun et al. 2006) . Used with kind permission of Yasemin Altun. 


Cutting plane methods for fitting SSVMs 


In this section, we discuss an efficient algorithm for fitting SSVMs due to Joachims et al. 2009). 
This method can handle general loss functions, and is implemented in the popular SVMstruct 
package". The method is based on the cutting plane method from convex optimization (Kelley 
1960). 

The basic idea is as follows. We start with an initial guess w and no constraints. At each 
iteration, we then do the following: for each example i, we find the “most violated” constraint 
involving x; and y;. If the loss-augmented margin violation exceeds the current value of €; by 
more than e, we add y; to the working set of constraints for this training case, W;, and then 
solve the resulting new QP to find the new w, &. See Figure 19.21 for a sketch, and Algorithm 11 
for the pseudo code. (Since at each step we only add one new constraint, we can warm-start 
the QP solver.) We can can easily modify the algorithm to optimize the slack rescaling version 
by replacing the expression L(y;,y) — wT 6;(yi) with L(yi,y)(1 — w76;(y;)). 

The key to the efficiency of this method is that only polynomially many constraints need to 
be added, and as soon as they are, the exponential number of other constraints are guaranteed 
to also be satisfied to within a tolerance of e (see (Tsochantaridis et al. 2005) for the proof). 


Loss-augmented decoding 


The other key to efficiency is the ability to find the most violated constraint in line 5 of the 
algorithm, i.e., to compute 


argmax L(yi, y) — wT di(y) = argmax L(y;,y) + w! b(xi, y) (19.107) 
yey yey 


lL http: //svmlight.joachims.org/svm_struct.html 
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Algorithm 19.3: Cutting plane algorithm for SSVMs (margin rescaling, N-slack version) 
1 Input D = {(x1,y1),---, (XN; Yn)}, O, €; 

2W;=90,€ =O fori=1:N; 

3 repeat 

4 for i = 1 : N do 


5 Yi = argmaxy, ey L(y y)— wT 6;(¥:) ; 
6 if L(yi,y) — wT; (ĵi) > ĉi + then 
7 W; = W; U {yi} ; 
: N 
8 (w, £) = argminy, ¢>0 5|lwl||3 +C Ži £i ; 


st. Vi=1:N,Vy’ € W,: w"6i(¥i) > Liyi,y’) -—&; 


_ 


o until no W; has changed; 
Return (w, £) 


= 


We call this process loss-augmented decoding. (In (Joachims et al. 2009), this procedure is 
called the separation oracle.) If the loss function has an additive decomposition of the same 
form as the features, then we can fold the loss into the weight vector, i.e., we can find a new 
set of parameters w’ such that (w’)"6;(y) = w76;(y). We can then use a standard decoding 
algorithm, such as Viterbi, on the model p(y|x, w’). 

In the special case of 0-1 loss, the optimum will either be the best solution, argmaxy wl o(xi, y) 
with a value of of 0 — w 5:(¥), or it will be the second best solution, i.e., 


ý = argmax wT $(x;, y) (19.108) 
y#Y 


which achieves an overall value of 1 — w75;(y). For chain structured CRFs, we can use the 
Viterbi algorithm to do decoding; the second best path will differ from the best path in a single 
position, which can be obtained by changing the variable whose max marginal is closest to its 
decision boundary to its second best value. We can generalize this (with a bit more work) to 
find the N-best list (Schwarz and Chow 1990; Nilsson and Goldberger 2001). 

For Hamming loss, L(y*, y) = >>, (yf # yz), and for the Fl score (defined in Section 5.7.2.3), 
we can devise a dynamic programming algorithm to compute Equation 19.107. See (Altun et al. 
2006) for details. Other models and loss function combinations will require different methods. 


19.7.3.2 A linear time algorithm 


Although the above algorithm takes polynomial time, we can do better, and devise an algorithm 
that runs in linear time, assuming we use a linear kernel (i.e., we work with the original features 
(x,y) and do not apply the kernel trick). The basic idea, as explained in Qoachims et al. 
2009), is to have a single slack variable, £, instead of NV, but to use |Y | constraints, instead of 


19.7.4 
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just N|Y|. Specifically, we optimize the following (assuming the margin rescaling formulation): 


: 1 2 
E gllwlla + CE 


N N 
1 1 
E Vy N: wT N Sily) > — N Lyi ¥;) — €09.109 
st. Wi, ¥n) EY": ow 2, W2yd, (yi, ¥i) — £09109) 
Compare this to the original version, which was 
min ‘Ilwig + Ze st. Vi=1:N,Vy €V: wi di(y) > Lyi, ¥;) —& (19.110) 
w,é>0 2 2 N S.t. iret VY : WY) = Yi: Yi i . 


One can show that any solution w* of Equation 19.109 is also a solution of Equation 19.110 and 
vice versa, with €* = EE. 


Algorithm 19.4: Cutting plane algorithm for SSVMs (margin rescaling, 1-slack version) 
1 Input D = {(X1, y1), e... (XN, Yn)b C; €; 


2W=%9; 
3 repeat 
; ; N 
4 | (w,€) = argminw e>0 zllwll +O Xi E; 
= a N = N = 
5 s.t. V(¥1,---,¥n) EW: aw X1 OY) = x ai voy — E 


6 for i=1:N do 

7 | fi = argmaxy, cy L(yi, yi) + w’ O(xi, ĵi) 

8 W=WU{(¥1,---,9N)} 

9 until + 7%, Lyi, 9:1) — wT DN, 6i(F:) < E+E 
10 Return (w, £) 


We can optimize Equation 19.109 using the cutting plane algorithm in Algorithm 10. (This 
is what is implemented in SVMstruct.) The inner QP in line 4 can be solved in O(N) time 
using the method of (Joachims 2006). In line 7 we make N calls to the loss-augmented decoder. 
Finally, it can be shown that the number of iterations is a constant independent on N. Thus 
the overall running time is linear. 


Online algorithms for fitting SSVMs 


Although the cutting plane algorithm can be made to run in time linear in the number of data 
points, that can still be slow if we have a large dataset. In such cases, it is preferable to use 
online learning. We briefly mention a few possible algorithms below. 


The structured perceptron algorithm 


A very simple algorithm for fitting SSVMs is the structured perceptron algorithm (Collins 
2002). This method is an extension of the regular perceptron algorithm of Section 8.5.4. At each 


19.7.4.2 
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step, we compute y = argmax p(y|x) (e.g., using the Viterbi algorithm) for the current training 
sample x. If y = y, we do nothing, otherwise we update the weight vector using 


Wi+1 = Wk + Oly, x) — OY, x) (19.111) 


To get good performance, it is necessary to average the parameters over the last few updates 
(see Section 8.5.2 for details), rather than using the most recent value. 


Stochastic subgradient descent 


The disadvantage of the structured perceptron algorithm is that it implicitly assumes 0-1 loss, 
and it does not enforce any kind of margin. An alternative approach is to perform stochastic 
subgradient descent. A specific instance of this the Pegasos algorithm (Shalev-Shwartz et al. 
2007), which stands for “primal estimated sub-gradient solver for SVM”. Pegasos was designed 
for binary SVMs, but can be extended to SSVMS. 

Let us start by considering the objective function: 


f(w) = max [L(yi, yi) two (xi, yi) — wT &(xi, yi) + Al|w]|? (19.112) 
Letting y; be the argmax of this max. Then the subgradient of this objective function is 


N 
gw) = XO olx’) — O(xi, yi) + 2Aw (19.113) 
i=1 


In stochastic subgradient descent, we approximate this gradient with a single term, 7, and then 
perform an update: 


Wk+1 = Wk — 1kGi(Wk) = Wk — Ne O(Ki, ĵi) — (xi, yi) + (2/N) Aw] (19.114) 


where 7;, is the step size parameter, which should satisfy the Robbins-Monro conditions (Sec- 
tion 8.5.2.1). (Notice that the perceptron algorithm is just a special case where A = 0 and 
nk = 1.) To ensure that w has unit norm, we can project it onto the £2 ball after each update. 


Latent structural SVMs 


In many applications of interest, we have latent or hidden variables h. For example, in object 
detections problems, we may be told that the image contains an object, so y = 1, but we may 
not know where it is. The location of the object, or its pose, can be considered a hidden variable. 
Or in machine translation, we may know the source text x (say English) and the target text y 
(say French), but we typically do not know the alignment between the words. 
We will extend our model as follows, to get a latent CRF: 
exp(w" (x, y, h)) 
h|x,w) = 19.115 
p(y, h|x, w) Zw (19.115) 
Z(x,w) = >D exp(wT (x,y, h)) (19.116) 
y,h 
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In addition, we introduce the loss function L(y*, y, h); this measures the loss when the “action” 
that we take is to predict y using latent variables h. We could just use L(y*, y) as before, since 
h is usually a nuisance variable and not of direct interest. However, h can sometimes play a 
useful role in defining a loss function.” 

Given the loss function, we define our objective as 


exp(w’ (x, y, h)) 


Rert(w) = —logp(w)+ 2 log 2 exp L(yi, y, h) Tew) (19.117) 
Using the same loose lower bound as before, we get 
7 ~ 
Rgr(w) < E(w) + S max {iyi.y.h) + w"$(xi,y,h)} 
penis 
N 
— 5 max wT (xi, yi, h) (19.118) 
i=1 


If we set E(w) = —5e||wl||3, we get the same objective as is optimized in latent SVMs (Yu 
and Joachims 2009). 

Unfortunately, this objective is no longer convex. However, it is a difference of convex 
functions, and hence can be solved efficiently using the CCCP or concave-convex procedure 
(Yuille and Rangarajan 2003). This is a method for minimizing functions of the form f(w) — 
g(w), where f and g are convex. The method alternates between finding a linear upper bound 
u on —g, and then minimizing the convex function f(w) + u(w); see Algorithm 6 for the 
pseudocode. CCCP is guaranteed to decrease the objective at every iteration, and to converge to 


a local minimum or a saddle point. 


Algorithm 19.5: Concave-Convex Procedure (CCCP) 


1 Set t = 0 and initialize wo ; 

2 repeat 

3 Find hyperplane v; such that —g(w) < —g(w:) + (w — wz)? v: for all w ; 
4 Solve wi41 = argmin,, f(w) + wv; ; 

5 Sett=t+1l 

6 until converged; 


When applied to latent SSVMs, CCCP is very similar to (hard) EM. In the “E step”, we compute 


12. For example, consider the problem of learning to classify a set of documents as relevant or not to a query. That 
is, given n documents x1,...,Xn for a single query q, we want to produce a labeling yj E€ {—1, +1}, representing 
whether document j is relevant to q or not. Suppose our goal is to maximize the precision at k, which is a metric widely 
used in ranking (see Section 9.7.4). We will introduce a latent variable for each document hj representing its degree 
of relevance. This corresponds to a latent total ordering, that has to be consistent with the observed partial ordering 


y. Given this, we can define the following loss function: L(y, 7, h) = min{1, nid} 5 x (yn, = 1), where 
n(y) is the total number of relevant documents. This loss is essentially just 1 minus the precision@k, except we replace 


1 with n(y)/k so that the loss will have a minimum of zero. See (Yu and Joachims 2009) for details. 
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the linear upper bound by setting v; = —C > (xi, Yi, ht), where 
h; = argmax wT $(x;, yi, h) (19.119) 
h 


In the “M step”, we estimate w using techniques for solving fully visible SSVMs. Specifically, we 
minimize 

1 N N 

alle + 2 max {L(yi,y,h) +w" 6(xi,y,h)}— CY wi o(xi,yi hi) 09120) 


i=l 


Exercises 


Exercise 19.1 Derivative of the log partition function 


Derive Equation 19.40. 


Exercise 19.2 CI properties of Gaussian graphical models 
(Source: Jordan.) 


In this question, we study the relationship between sparse matrices and sparse graphs for Gaussian 
graphical models. Consider a multivariate Gaussian M (z|u, ©) in 3 dimensions. Suppose u = (0,0, 0)” 
throughout. 

Recall that for jointly Gaussian random variables, we know that X; and X; are independent iff they are 
uncorrelated, ie. i; = 0. (This is not true in general, or even if X; and X; are Gaussian but not jointly 
Gaussian.) Also, X; is conditionally independent of X; given all the other variables iff x = 0. 


a. Suppose 
0.75 0.5 0.25 
“={05 10 0.5 
0.25 0.5 0.75 


Are there any marginal independencies amongst X1, X2 and X3? What about conditional indepen- 


dencies? Hint: compute X7 ' and expand out «7 ~'x: which pairwise terms ziz; are missing? Draw 
an undirected graphical model that captures as many of these independence statements (marginal and 
conditional) as possible, but does not make any false independence assertions. 


b. Suppose 
2 1 0 
H=]1 2 1 
0 1 2 


Are there any marginal independencies amongst Xı, X2 and X3? Are there any conditional inde- 
pendencies amongst X1, X2 and X3? Draw an undirected graphical model that captures as many of 
these independence statements (marginal and conditional) as possible, but does not make any false 
independence assertions. 


c. Now suppose the distribution on X can be represented by the following DAG: 
Xı > X2 > X3 
Let the CPDs be as follows: 
P(X1) = N(X1;0,1), P(X2|x1) = N(X2;x1,1), P(X3|x2) = N (X3; v2, 1) (19.121) 


Multiply these 3 CPDs together and complete the square (Bishop p101) to find the corresponding joint 
distribution N (X1:3|u, ©). (You may find it easier to solve for D! rather than X.) 
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d. For the DAG model in the previous question: Are there any marginal independencies amongst X1, X2 
and X3? What about conditional independencies? Draw an undirected graphical model that captures 
as many of these independence statements as possible, but does not make any false independence 
assertions (either marginal or conditional). 


Exercise 19.3 Independencies in Gaussian graphical models 


(Source: MacKay.) 


a. Consider the DAG X1 + X2 — X3. Assume that all the CPDs are linear-Gaussian. Which of the 
following matrices could be the covariance matrix? 


9 3 1 8 -3 1 9 3 0 9 -3 0 
A={/3 9 3]),B=|{-3 9 -3],C=|13 9 3]),D={-3 10 —38 19.122) 
1 3 9 1 -3 8 0 3 9 0 -3 9 


b. Which of the above matrices could be inverse covariance matrix? 


c. Consider the DAG X1 —> X2 + X3. Assume that all the CPDs are linear-Gaussian. Which of the 
above matrices could be the covariance matrix? 


d. Which of the above matrices could be the inverse covariance matrix? 


e. Let three variables 71, 72, 74 have covariance matrix 1:3) and precision matrix Qa:3) = Daa) as 
follows 
1 0.5 0 1.5 —1 0.5 
Sas)= (05 1 05], Qa3=|-1 2 -1 (19.123) 
0 0.5 1 0.5 —1 1.5 


Now focus on xı and x2. Which of the following statements about their covariance matrix X12) and 
precision matrix Qa:2) are true? 


1. 0.5 1.5 =1 
A: % (1:2) = 1 ) ; B: Qa:2) = & 2 ) (19.124) 


Exercise 19.4 Cost of training MRFs and CRFs 


(Source: Koller.) Consider the process of gradient-ascent training for a log-linear model with k features, 
given a data set with N training instances. Assume for simplicity that the cost of computing a single 
feature over a single instance in our data set is constant, as is the cost of computing the expected value 
of each feature once we compute a marginal over the variables in its scope. Assume that it takes c time 
to compute all the marginals for each data case. Also, assume that we need r iterations for the gradient 
process to converge. 

e Using this notation, what is the time required to train an MRF in big-O notation? 


e Using this notation, what is the time required to train a CRF in big-O notation? 


Exercise 19.5 Full conditional in an Ising model 


Consider an Ising model 
1 n 
Pts, 818) = Fog il exp(Jijvix;) [exp (se) (19.125) 


where < ij > denotes all unique pairs (ie., all edges), Ji; € R is the coupling strength (weight) on edge 
i — j, hi € R is the local evidence (bias term), and 8 = (J, h) are all the parameters. 
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If x; € {0, 1}, derive an expression for the full conditional 


p(ai = 1|x_i, 8) = p(xi = 1]Xno,, 0) (19.126) 


where x_,; are all nodes except i, and nb; are the neighbors of i in the graph. Hint: you answer should 
use the sigmoid/ logistic function o(z) = 1/(1 + e77). Now suppose x; € {—1,+1}. Derive a related 
expression for p(a;|x—:,@) in this case. (This result can be used when applying Gibbs sampling to the 
model.) 
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20.2.1 


Exact inference for graphical models 


Introduction 


In Section 17.4.3, we discussed the forwards-backwards algorithm, which can exactly compute the 
posterior marginals p(x,|v,@) in any chain-structured graphical model, where x are the hidden 
variables (assumed discrete) and v are the visible variables. This algorithm can be modified 
to compute the posterior mode and posterior samples. A similar algorithm for linear-Gaussian 
chains, known as the Kalman smoother, was discussed in Section 18.3.2. Our goal in this chapter 
is to generalize these exact inference algorithms to arbitrary graphs. The resulting methods apply 
to both directed and undirected graphical models. We will describe a variety of algorithms, but 
we omit their derivations for brevity. See e.g., (Darwiche 2009; Koller and Friedman 2009) for a 
detailed exposition of exact inference techniques for discrete directed graphical models. 


Belief propagation for trees 


In this section, we generalize the forwards-backwards algorithm from chains to trees. The 
resulting algorithm is known as belief propagation (BP) (Pearl 1988), or the sum-product 
algorithm. 


Serial protocol 


We initially assume (for notational simplicity) that the model is a pairwise MRF (or CRF), i.e., 


p(x|v) = Zw LY) I selts, z) (20.1) 


sEV (s,t)€E 


where 7, is the local evidence for node s, and Wst is the potential for edge s — t. We will 
consider the case of models with higher order cliques (such as directed trees) later on. 

One way to implement BP for undirected trees is as follows. Pick an arbitrary node and call it 
the root, r. Now orient all edges away from r (intuitively, we can imagine “picking up the graph” 
at node r and letting all the edges “dangle” down). This gives us a well-defined notion of parent 
and child. Now we send messages up from the leaves to the root (the collect evidence phase) 
and then back down from the root (the distribute evidence phase), in a manner analogous to 
forwards-backwards on chains. 
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(a) (b) 


Figure 20.1 Message passing on a tree. (a) Collect-to-root phase. (b) Distribute-from-root phase. 


To explain the process in more detail, consider the example in Figure 20.1. Suppose we want 
to compute the belief state at node t. We will initially condition the belief only on evidence that 
is at or below ¢ in the graph, i.e., we want to compute bel; (x+) £ p(a:|v;, ). We will call this a 
“bottom-up belief state”. Suppose, by induction, that we have computed “messages” from t's two 
children, summarizing what they think t should know about the evidence in their subtrees, i.e., 
we have computed m__,,(x+) = p(a,|v,,), where v}, is all the evidence on the downstream 
side of the s — t edge (see Figure 20.1(a)), and similarly we have computed m._,:(x;). Then we 
can compute the bottom-up belief state at t as follows: 


1 
bel; (21) Ê p(aelve) = pve(ae) [] male) (20.2) 
t cEch(t) 


where (x+) x p(az|vz) is the local evidence for node t, and Z; is the local normalization 
constant. In words, we multiply all the incoming messages from our children, as well as the 
incoming message from our local evidence, and then normalize. 

We have explained how to compute the bottom-up belief states from the bottom-up messages. 
How do we compute the messages themselves? Consider computing m,_,,(vz), where s is one 
of t’s children. Assume, by recursion, that we have computed bel, (xs) = p(as|v, ). Then we 
can compute the message as follows: 


MZ yeli) = X Vee (we, 24)bel; (£s) (20.3) 


Ts 


Essentially we convert beliefs about x, into beliefs about x, by using the edge potential Yst. 
We continue in this way up the tree until we reach the root. Once at the root, we have “seen” 
all the evidence in the tree, so we can compute our local belief state at the root using 


bel, (xr) = p(zr|v) = p(aslv, ) « tr(2r) II Me_yr(Zr) (20.4) 
c€ch(r) 


This completes the end of the upwards pass, which is analogous to the forwards pass in an 
HMM. As a “side effect’, we can compute the probability of the evidence by collecting the 
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normalization constants: 


pv) = [[% (20.5) 
t 


We can now pass messages down from the root. For example, consider node s, with parent t, 
as shown in Figure 20.1(b). To compute the belief state for s, we need to combine the bottom-up 
belief for s together with a top-down message from t, which summarizes all the information in 
the rest of the graph, m7,,(x5) = p(a:|vz,), where vý is all the evidence on the upstream 


(root) side of the s — t edge, as shown in Figure 20.1(b). We then have 
bels(xs) Ê p(as|v) x bel; (zs) [| mt,, (xe) (20.6) 
tEpa(s) 


How do we compute these downward messages? For example, consider the message from t 
to s. Suppose ts parent is r, and t’s children are s and u, as shown in Figure 20.1(b). We want 


to include in m} ",, all the information that t has received, except for the information that s 
sent it: 
bel, Tt 
mit sa(ts) © plesi) = Y Poe ta, r1) oe 2027 
Lt ms 44(Lt) 


Rather than dividing out the message sent up to t, we can plug in the equation of bel, to get 


tit yal ta) a ` Yst(Ls, Lt) Yilxt) II Me54(2t) Il mi (xe) (20.8) 


c€ch(t),c#s pepa(t) 


In other words, we multiply together all the messages coming into t from all nodes except for 
the recipient s, combine together, and then pass through the edge potential st. In the case of 
a chain, t only has one child s and one parent p, so the above simplifies to 


ii Aes) = DD Wet (ze, EVlT 44 (xt) (20.9) 


Tt 


The version of BP in which we use division is called belief updating, and the version in 
which we multiply all-but-one of the messages is called sum-product. The belief updating 
version is analogous to how we formulated the Kalman smoother in Section 18.3.2: the top- 
down messages depend on the bottom-up messages. This means they can be interpreted as 
conditional posterior probabilities. The sum-product version is analogous to how we formulated 
the backwards algorithm in Section 17.4.3: the top-down messages are completely independent 
of the bottom-up messages, which means they can only be interpreted as conditional likelihoods. 
See Section 18.3.2.3 for a more detailed discussion of this subtle difference. 


Parallel protocol 


So far, we have presented a serial version of the algorithm, in which we send messages up 
to the root and back. This is the optimal approach for a tree, and is a natural extension of 
forwards-backwards on chains. However, as a prelude to handling general graphs with loops, we 
now consider a parallel version of BP. This gives equivalent results to the serial version but is 
less efficient when implemented on a serial machine. 
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The basic idea is that all nodes receive messages from their neighbors in parallel, they then 
updates their belief states, and finally they send new messages back out to their neighbors. 
This process repeats until convergence. This kind of computing architecture is called a systolic 
array, due to its resemblance to a beating heart. 

More precisely, we initialize all messages to the all I's vector. Then, in parallel, each node 
absorbs messages from all its neighbors using 


bele(as) x Ws(as) [[ mess(2s) (20.10) 


tEnbrs 


Then, in parallel, each node sends messages to each of its neighbors: 


Ms—t(Xt) = 5 WPs(Ls)PstlTs, £t) II Mu+s(Ls) (20.11) 


Ts uEnbrs\t 


The m,_,, message is computed by multiplying together all incoming messages, except the one 
sent by the recipient, and then passing through the y,,; potential. 

At iteration T of the algorithm, bel,(x,) represents the posterior belief of x, conditioned on 
the evidence that is T steps away in the graph. After D(G) steps, where D(G) is the diameter 
of the graph (the largest distance between any two pairs of nodes), every node has obtained 
information from all the other nodes. Its local belief state is then the correct posterior marginal. 
Since the diameter of a tree is at most |V| — 1, the algorithm converges in a linear number of 
steps. 

We can actually derive the up-down version of the algorithm by imposing the condition that 
a node can only send a message once it has received messages from all its other neighbors. 
This means we must start with the leaf nodes, which only have one neighbor. The messages 
then propagate up to the root and back. We can also update the nodes in a random order. 
The only requirement is that each node get updated D(G) times. This is just enough time for 
information to spread throughout the whole tree. 

Similar parallel, distributed algorithms for solving linear systems of equations are discussed 
in (Bertsekas 1997). In particular, the Gauss-Seidel algorithm is analogous to the serial up-down 
version of BP, and the Jacobi algorithm is analogous to the parallel version of BP. 


Gaussian BP * 


Now consider the case where p(x|v) is jointly Gaussian, so it can be represented as a Gaussian 
pairwise MRE, as in Section 19.4.4. We now present the belief propagation algorithm for this 
class of models, follow the presentation of (Bickson 2009) (see also (Malioutov et al. 2006)). We 
will assume the following node and edge potentials: 


1 
pila) = exp(— 5 Aur + bix) (20.12) 
1 
bst(@s,te) = exp(—5asAsre) (20.13) 
so the overall model has the form 


1 
p(x|v) x exp(—5x! Ax +b’ x) (20.14) 
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This is the information form of the MVN (see Exercise 9.2), where A is the precision matrix. 
Note that by completing the square, the local evidence can be rewritten as a Gaussian: 


pelar) x N (bi /An, An) EN (m, &') (20.15) 
Below we describe how to use BP to compute the posterior node marginals, 
plæilv) = N (ut, A+) (20.16) 


If the graph is a tree, the method is exact. If the graph is loopy, the posterior means may still 
be exact, but the posterior variances are often too small (Weiss and Freeman 1999). 

Although the precision matrix A is often sparse, computing the posterior mean requires 
inverting it, since u = A~‘b. BP provides a way to exploit graph structure to perform this 
computation in O(D) time instead of O(D?). This is related to various methods from linear 
algebra, as discussed in (Bickson 2009). 

Since the model is jointly Gaussian, all marginals and all messages will be Gaussian. The 
key operations we need are to multiply together two Gaussian factors, and to marginalize out a 
variable from a joint Gaussian factor. 

For multiplication, we can use the fact that the product of two Gaussians is Gaussian: 


N(alua,Ay*) x N(a|u2,Ag") = CN(a|u,A7") (20.17) 
ye Ay (20.18) 
u = A7*(p Ad + p2Az2) (20.19) 

where 


À 1 
C=, oP (Gauges — 1) + Aop3(A7*A2 — 1) + 21a doit) ) (20.20) 
1A2 


See Exercise 20.2 for the proof. 
For marginalization, we have the following result: 


J eoar + bz)dz = yr/a exp(b?/4a) (20.21) 


which follows from the normalization constant of a Gaussian (Exercise 2.11). 

We now have all the pieces we need. In particular, let the message m,_;(24) be a Gaussian 
with mean jst and precision Ast. From Equation 20.10, the belief at node s is given by the 
product of incoming messages times the local evidence (Equation 20.15) and hence 


bela.) = v.@.) [[ isles) =V (zslus,A7+) (20.22) 
tEnbr(s) 

As = bt J às (20.23) 
tEnbr(s) 

ps = At |Lms+ XO Atshts (20.24) 


tEnbr(s) 
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To compute the messages themselves, we use Equation 20.11, which is given by 


Mst) = I bet(@s,t2)s(ts) || muse(as) | dzs (20.25) 


Sa uEnbrs\t 


= / et (a, £t) fatl des (20.26) 


where f,\;(2s) is the product of the local evidence and all incoming messages excluding the 
message from t: 


fase) £ Hele) [[  muse(ae) (20.27) 
uEnbrs\t 
= N(zsluan An) (20.28) 
Ane Ê bt JO Aus (20.29) 
uEnbr(s)\t 
Ms\t E Ae LsMs + 5 Austus (20.30) 
uEnbr(s)\t 


Returning to Equation 20.26 we have 


Masili) = J exp(—TsAstTt) exp(—As\t/2(£s — pae)?) dzs (20.31) 
r: a na amii 
Wst(Ls,Lt) fs\t(@s) 

= I exp ((—As\e2/2) + (As\thls\t — Agt@t)2s) dx, + const (20.32) 

x exp ((As\tHs\t — AgtX1)”/(2Xs\t)) (20.33) 

x NM (Uses Ast ) (20.34) 

As = AR Ae (20.35) 

Hst = AÁstHat/Ast (20.36) 


One can generalize these equations to the case where each node is a vector, and the messages 
become small MVNs instead of scalar Gaussians (Alag and Agogino 1996). If we apply the 
resulting algorithm to a linear dynamical system, we recover the Kalman smoothing algorithm 
of Section 18.3.2. 

To perform message passing in models with non-Gaussian potentials, one can use sampling 
methods to approximate the relevant integrals. This is called non-parametric BP (Sudderth 
et al. 2003; Isard 2003; Sudderth et al. 2010). 


20.2.4 Other BP variants * 


In this section, we briefly discuss several variants of the main algorithm. 
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tt Ww +1 Bett 


Ut Ut4+1 


Figure 20.2 Illustration of how to compute the two-slice distribution for an HMM. The y and Y+ 
terms are the local evidence messages from the visible nodes v+, vi+1 to the hidde nodes x, £t+ 
respectively; f+ is the forwards message from 2;—1 and (+1 is the backwards message from x;+2. 


pa 


= 


Max-product algorithm 


It is possible to devise a max-product version of the BP algorithm, by replacing the X` operator 
with the max operator. We can then compute the local MAP marginal of each node. However, 
if there are ties, this might not be globally consistent, as discussed in Section 17.4.4. Fortunately, 
we can generalize the Viterbi algorithm to trees, where we use max and argmax in the collect- 
to-root phase, and perform traceback in the distribute-from-root phase. See (Dawid 1992) for 


details. 

Sampling from a tree 

It is possible to draw samples from a tree structured model by generalizing the forwards filtering 
/ backwards sampling algorithm discussed in Section 17.4.5. See (Dawid 1992) for details. 


Computing posteriors on sets of variables 


In Section 17.4.3.2, we explained how to compute the “two-slice” distribution €441(7,7) = 
plz = i, £441 = j|v) in an HMM, namely by using 


Erti li j) = alibi libe 1) Ye t41 (4, 9) (20.37) 


Since a(i) x y(t) fili), where fè = p(a¢|v1-4-1) is the forwards message, we can think of 
this as sending messages f; and y, into £+ 5,4, and @;141 into 2441, and then combining 
them with the © matrix, as shown in Figure 20.2. This is like treating x, and x44 as a single 


“mega node”, and then multiplying all the incoming messages as well as all the local factors 


there, pi +1). 
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Coherence 


Figure 20.3 Left: The “student” DGM. Right: the equivalent UGM. We add moralization arcs D-I, G-J and 
LS. Based on Figure 9.8 of (Koller and Friedman 2009). 


The variable elimination algorithm 


We have seen how to use BP to compute exact marginals on chains and trees. In this section, 
we discuss an algorithm to compute p(x,|x,) for any kind of graph. 

We will explain the algorithm by example. Consider the DGM in Figure 20.3(a). This model, 
from (Koller and Friedman 2009), is a hypothetical model relating various variables pertaining to 
a typical student. The corresponding joint has the following form: 


P(C, D,I,G,S,L, J, H) (20.38) 
= P(C)P(D|C)P()P(GIL, D)P(S|D)P(L|G)P(J|L, 8) P(H|G, J) (20.39) 


Note that the forms of the CPDs do not matter, since all our calculations will be symbolic. 
However, for illustration purposes, we will assume all variables are binary. 

Before proceeding, we convert our model to undirected form. This is not required, but it 
makes for a more unified presentation, since the resulting method can then be applied to both 
DGMs and UGMs (and, as we will see in Section 20.3.1, to a variety of other problems that 
have nothing to do with graphical models). Since the computational complexity of inference in 
DGMs and UGMs is, generally speaking, the same, nothing is lost in this transformation from a 
computational point of view! 

To convert the DGM to a UGM, we simply define a potential or factor for every CPD, yielding 


p(C, D,1,G,S,L, J, H) (20.40) 
= Yo(C)vn(D, Ci (DvalG, I, D)bs(S, Db (L, GWI (J, L, S)bu (H, G, J20.40 


1. There are a few “tricks” one can exploit in the directed case that cannot easily be exploited in the undirected case. 
One important example is barren node removal. To explain this, consider a naive Bayes classifier, as in Figure 10.2. 
Suppose we want to infer y and we observe xı and x2, but not x3 and x4. It is clear that we can safely remove 
x3 and 24, since oe p(x3ly) = 1, and similarly for x4. In general, once we have removed hidden leaves, we can 
apply this process recursively. Since potential functions do not necessary sum to one, we cannot use this trick in the 
undirected case. See (Koller and Friedman 2009) for a variety of other speedup tricks. 
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Since all the potentials are locally normalized, since they are CPDs, there is no need for a 
global normalization constant, so Z = 1. The corresponding undirected graph is shown in 
Figure 20.3(b). Note that it has more edges than the DAG. In particular, any “unmarried” nodes 
that share a child must get “married”, by adding an edge between them; this process is known 
as moralization. Only then can the arrows be dropped. In this example, we added D-I, G-J, and 
LS moralization arcs. The reason this operation is required is to ensure that the CI properties 
of the UGM match those of the DGM, as explained in Section 19.2.2. It also ensures there is a 
clique that can “store” the CPDs of each family. 

Now suppose we want to compute p(J = 1), the marginal probability that a person will get a 
job. Since we have 8 binary variables, we could simply enumerate over all possible assignments 
to all the variables (except for J), adding up the probability of each joint instantiation: 


= Dad Pg) OP AGE ae) (20.42) 


LELE S G H 


However, this would take O(27) time. We can be smarter by pushing sums inside products. 
This is the key idea behind the variable elimination algorithm (Zhang and Poole 1996), also 
called bucket elimination (Dechter 1996), or, in the context of genetic pedigree trees, the 
peeling algorithm (Cannings et al. 1978). In our example, we get 


p(J) = S> oC, D, I, G, S, L, J, H) 
L,S,G,H,I,D,C 


= Yo ve (C)bv(D, Cir (Dva(G, I, D)bs(S, Dvr (L, G) 


L,S,G,H,I,D,C 
xpa(J, L, S)ba (A, G, J) 


= 2 bi(J,L, 8) 3 YL(L,G) 2 bu(H,G, J) X ¥s(S, Dyr) 
J 
Evel.) 2 tole )p(D,C) 


We now evaluate this expression, working right to left as shown in Table 20.1. First we multiply 
together all the terms in the scope of the Xœ operator to create the temporary factor 


71(C, D) = vo(C)bn(D,C) (20.43) 
Then we marginalize out C to get the new factor 
n(D) => 7 (C, D) (20.44) 
C 


Next we multiply together all the terms in the scope of the X` p operator and then marginalize 
out to create 


7(G,I,D) = ya(G,I, Dn) (20.45) 


12(G, I) XOG, 1, D) (20.46) 
D 


Il 
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OM 
M 


byr(J,L, 8) X vr(L,G) X va(H,G, J) X bs(S, DWO) X valG, 1, D) X bc(C)bn(D,C) 
H I D C 


G 


71(D) 


Dod bs, L, S) X br (L, G) X ba (A, G, J) bs(5, Dvr) So ba(G, I, D)1(D) 
L S G A D 


I 


T2(G,I) 
YOYO bs, L, S) X bi (L, G) X va (ALG, J) X bs (8, Dvr (1)72(G, D) 
Lo S G A I 
73(G,S) 
So do YC, L, 8) X bi (L, G) X ba (A, G, J) 73(G, S) 
L 8 G H 
T4(G,J) 
YOS bs, L, 8) X vr (L, Gra (G, J)73(G, 9) 
L S Xa 
75(J,L,S) 


XOS bs, L, S)t5(J, L, S) 
L Ss 


N: 


T6(J,L) 


> 76(J, L) 


L 


T7(J) 


Table 20.1 Eliminating variables from Figure 20.3 in the order C, D, I, H, GŒ, S, L to compute P(J). 


Next we multiply together all the terms in the scope of the $`; operator and then marginalize 
out to create 


73(G,I,S) = ws(S,Dwbr(1)72(G, T) (20.47) 
73(G,5) = > 74(G,1,9) (20.48) 
aE 
And so on. 


The above technique can be used to compute any marginal of interest, such as p(J) or 
p(J,H). To compute a conditional, we can take a ratio of two marginals, where the visible 
variables have been clamped to their known values (and hence don’t need to be summed over). 
For example, 


p(J = j,1 =1,H =0) 


J=jI=1,H=0 20.49 
pe) ) yy PJ = 9,1 =1,H = 0) ee 
In general, we can write 
xis Kas Xp 
p(Xq/x») = P(g» Xv) _ Dern PO Xa X) (20.50) 


D(Xv) E Da Dx P(Xh, xy Xy) 
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The normalization constant in the denominator, p(x,), is called the probability of the evi- 
dence. 

See variableElimination for a simple Matlab implementation of this algorithm, which 
works for arbitrary graphs, and arbitrary discrete factors. But before you go too crazy, please 
read Section 20.3.2, which points out that VE can be exponentially slow in the worst case. 


The generalized distributive law * 


Abstractly, VE can be thought of as computing the following expression: 
P(Xq|Xv) x X [| Pele) (20.51) 


It is understood that the visible variables x, are clamped, and not summed over. VE uses 
non-serial dynamic programming (Bertele and Brioschi 1972), caching intermediate results to 
avoid redundant computation. 

However, there are other tasks we might like to solve for any given graphical model. For 
example, we might want the MAP estimate: 


x“ = argmax I] We(Xe) (20.52) 


Fortunately, essentially the same algorithm can also be used to solve this task: we just replace 
sum with max. (We also need a traceback step, which actually recovers the argmax, as opposed 
to just the value of max; these details are explained in Section 17.4.4.) 

In general, VE can be applied to any commutative semi-ring. This is a set K, together with 
two binary operations called “+” and “x”, which satisfy the following three axioms: 


1. The operation “+” is associative and commutative, and there is an additive identity element 


called “0” such that k +0 = k for all k € K. 


2. The operation “x” is associative and commutative, and there is a multiplicative identity 
element called “1” such that k x 1 = k for all k € K. 


3. The distributive law holds, i.e., 
(axb)+(axc)=ax (b+c) (20.53) 


for all triples (a,b,c) from K. 


This framework covers an extremely wide range of important applications, including constraint 
satisfaction problems (Bistarelli et al. 1997; Dechter 2003), the fast Fourier transform (Aji and 
McEliece 2000), etc. See Table 20.2 for some examples. 


Computational complexity of VE 


The running time of VE is clearly exponential in the size of the largest factor, since we have sum 
over all of the corresponding variables. Some of the factors come from the original model (and 
are thus unavoidable), but new factors are created in the process of summing out. For example, 
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Domain + x Name 


1) sum-product 
,1) max-product 
0) min-sum 
,T) Boolean satisfiability 


Table 20.2 Some commutative semirings. 


YS DONA YS bs, S) So or (Ds (8,.D > be (G1, Dx (L, eu (H, G, J) 
DC A L § I G 


™1(1,D,L,J,H) 


Dd bv(D,C) STI DI bs, LS) X or (Ds (8, Dr, D, L, J, H) 
DC Pi 


A- L 8 
`‘ 


72(D,L,S,J,H) 


EEDS > YS vl, L, S)r2(D, L, S, J, H) 
DC HAH L S 
73(D,L,J,H) 
dodo bv(D,C) 5) > t(D, L, J, H) 
D C AoE 
ta(D,J,H) 


>>> od(D, C) >> (D, J, H) 
D C H 


T5(D,J) 


wbp(D, C)t5(D, J) 


M 
oM 


76(D,J) 


X Te(D, J) 
D 
— 
T7(J) 


Table 20.3 Eliminating variables from Figure 20.3 in the order G, I, S, L, H, C, D. 


in Equation 20.47, we created a factor involving G, I and S; but these nodes were not originally 
present together in any factor. 

The order in which we perform the summation is known as the elimination order. This 
can have a large impact on the size of the intermediate factors that are created. For example, 
consider the ordering in Table 20.1: the largest created factor (beyond the original ones in the 
model) has size 3, corresponding to 75(J, L, S). Now consider the ordering in Table 20.3: now 
the largest factors are mı (1, D, L, J, H) and 72(D, L, S, J, H), which are much bigger. 

We can determine the size of the largest factor graphically, without worrying about the actual 
numerical values of the factors. When we eliminate a variable X;, we connect it to all variables 
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Figure 20.4 Example of the elimination process, in the order C, D, I, etc. When we eliminate J (figure 
c), we add a fill-in edge between G and S, since they are not connected. Based on Figure 9.10 of (Koller 
and Friedman 2009). 


that share a factor with X; (to reflect the new temporary factor 7/). The edges created by this 
process are called fill-in edges. For example, Figure 20.4 shows the fill-in edges introduced 
when we eliminate in the order C, D,J,.... The first two steps do not introduce any fill-ins, 
but when we eliminate J, we connect G and S, since they co-occur in Equation 20.48. 

Let G(~<) be the (undirected) graph induced by applying variable elimination to G using 
elimination ordering <. The temporary factors generated by VE correspond to maximal cliques 
in the graph G(<). For example, with ordering (C, D, I, H,G, S, L), the maximal cliques are 
as follows: 


{C, D}, {D,I, G}, {G, L, S, J}, {G, J, H}, {G,I, S} (20.54) 


It is clear that the time complexity of VE is 


TL x" (20.55) 


c€C(G(R)) 


where C are the cliques that are created, |c| is the size of the clique c, and we assume for 
notational simplicity that all the variables have K states each. 

Let us define the induced width of a graph given elimination ordering <, denoted w(<), as 
the size of the largest factor (i.e., the largest clique in the induced graph ) minus 1. Then it is 
easy to see that the complexity of VE with ordering < is O(K(<)*+1), 

Obviously we would like to minimize the running time, and hence the induced width. Let us 
define the treewidth of a graph as the minimal induced width. 

w min max \c| —1 (20.56) 

< cé€G(R) 
Then clearly the best possible running time for VE is O(DK®+!). Unfortunately, one can show 
that for arbitrary graphs, finding an elimination ordering < that minimizes w(<) is NP-hard 
(Arnborg et al. 1987). In practice greedy search techniques are used to find reasonable orderings 
(Kjaerulff 1990), although people have tried other heuristic methods for discrete optimization, 
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such as genetic algorithms (Larranaga et al. 1997). It is also possible to derive approximate 
algorithms with provable performance guarantees (Amir 2010). 

In some cases, the optimal elimination ordering is clear. For example, for chains, we should 
work forwards or backwards in time. For trees, we should work from the leaves to the root. 
These orderings do not introduce any fill-in edges, so w = 1. Consequently, inference in chains 
and trees takes O(V K?) time. This is one reason why Markov chains and Markov trees are so 
widely used. 

Unfortunately, for other graphs, the treewidth is large. For example, for an m x n 2d lattice, 
the treewidth is O(min{m,n}) (Lipton and Tarjan 1979). So VE on a 100 x 100 Ising model 
would take O(2'°°) time. 

Of course, just because VE is slow doesn’t mean that there isn’t some smarter algorithm out 
there. We discuss this issue in Section 20.5. 


A weakness of VE 


The main disadvantage of the variable elimination algorithm (apart from its exponential depen- 
dence on treewidth) is that it is inefficient if we want to compute multiple queries conditioned 
on the same evidence. For example, consider computing all the marginals in a chain-structured 
graphical model such as an HMM. We can easily compute the final marginal p(a7|v) by elimi- 
nating all the nodes x, to x7_, in order. This is equivalent to the forwards algorithm, and takes 
O(K?T) time. But now suppose we want to compute p(27_1|v). We have to run VE again, at 
a cost of O(K?T) time. So the total cost to compute all the marginals is O(K?T?). However, 
we know that we can solve this problem in O(K°T) using forwards-backwards. The difference 
is that FB caches the messages computed on the forwards pass, so it can reuse them later. 

The same argument holds for BP on trees. For example, consider the 4-node tree in Fig- 
ure 20.5. We can compute p(a1|v) by eliminating x2:4; this is equivalent to sending messages 
up to xı (the messages correspond to the 7 factors created by VE). Similarly we can compute 
p(x2|v), p(a3|v) and then p(a4|v). We see that some of the messages used to compute the 
marginal on one node can be re-used to compute the marginals on the other nodes. By storing 
the messages for later re-use, we can compute all the marginals in O(DK?) time. This is what 
the up-down (collect-distribute) algorithm on trees does. 

The question is: how can we combine the efficiency of BP on trees with the generality of VE? 
The answer is given in Section 20.4. 


The junction tree algorithm * 
The junction tree algorithm or JTA generalizes BP from trees to arbitrary graphs. We sketch 


the basic idea below; for details, see e.g., (Koller and Friedman 2009). 


Creating a junction tree 


The basic idea behind the JTA is this. We first run the VE algorithm “symbolically”, adding fill-in 
edges as we go, according to a given elimination ordering. The resulting graph will be a chordal 
graph, which means that every undirected cycle X, — Xə- -- Xk — Xı of length k > 4 has a 
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Figure 20.5 Sending multiple messages along a tree. (a) X1 is root. (b) X2 is root. (c) X4 is root. (d) All 


of the messages needed to compute all singleton marginals. Based on Figure 4.3 of Jordan 2007). 


xt Q Do 
od ES 


(a) (b 


Figure 20.6 Left: this graph is not triangulated, despite appearances, since it contains a chordless 5-cycle 
1-2-3-4-5-1. Right: one possible triangulation, by adding the 1-3 and 1-4 fill-in edges. Based on (Armstrong 


2005, p46) 
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chord, i.e., an edge connects X;, X; for all non-adjacent nodes 7,7 in the cycle. 

Having created a chordal graph, we can extract its maximal cliques. In general, finding max 
cliques is computationally hard, but it turns out that it can be done efficiently from this special 
kind of graph. Figure 20.7(b) gives an example, where the max cliques are as follows: 


{C, D},{G, I, D},{G, 8, 1G, 5,8, 1},1H,G,J} (20.57) 


Note that if the original graphical model was already chordal, the elimination process would not 
add any extra fill-in edges (assuming the optimal elimination ordering was used). We call such 
models decomposable, since they break into little pieces defined by the cliques. 

It turns out that the cliques of a chordal graph can be arranged into a special kind of 
tree known as a junction tree. This enjoys the running intersection property (RIP), which 
means that any subset of nodes containing a given variable forms a connected component. 
Figure 20.7(c) gives an example of such a tree. We see that the node J occurs in two adjacent 
tree nodes, so they can share information about this variable. A similar situation holds for all 
the other variables. 

One can show that if a tree that satisfies the running intersection property, then applying 
BP to this tree (as we explain below) will return the exact values of p(x.|v) for each node c 
in the tree (i.e., clique in the induced graph). From this, we can easily extract the node and 
edge marginals, p(x;|v) and p(a,,x;|v) from the original model, by marginalizing the clique 
distributions.’ 


Message passing on a junction tree 


Having constructed a junction tree, we can use it for inference. The process is very similar 
to belief propagation on a tree. As in Section 20.2, there are two versions: the sum-product 
form, also known as the Shafer-Shenoy algorithm, named after (Shafer and Shenoy 1990); and 
the belief updating form (which involves division), also known as the Hugin (named after a 
company) or the Lauritzen-Spiegelhalter algorithm (named after (Lauritzen and Spiegelhalter 
1988)). See (Lepar and Shenoy 1998) for a detailed comparison of these methods. Below we 
sketch how the Hugin algorithm works. 
We assume the original model has the following form: 


p(x)= > [[ vel) (20.58) 


where C(G) are the cliques of the original graph. On the other hand, the tree defines a 
distribution of the following form: 


a= Iecr) We(Xe) 
p(x) = Iesi) Ws(Xs) 


2. The largest loop in a chordal graph is length 3. Consequently chordal graphs are sometimes called triangulated. 
However, it is not enough for the graph to look like it is made of little triangles. For example, Figure 20.6(a) is not 
chordal, even though it is made of little triangles, since it contains the chordless 5-cycle 1-2-3-4-5-1. 

3. If we want the joint distribution of some variables that are not in the same clique — a so-called out-of-clique 
query — we can adapt the technique described in Section 20.2.4.3 as follows: create a mega node containing the query 
variables and any other nuisance variables that lie on the path between them, multiply in messages onto the boundary 
of the mega node, and then marginalize out the internal nuisance variables. This internal marginalization may require 
the use of BP or VE. See (Koller and Friedman 2009) for details. 


(20.59) 
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Coherence ‘C Coherence 


Difficulty 


[cp H Gip HH asi _essz- ues} 
D GI Gs GJ 


(c) 


Figure 20.7 (a) The student graph with fill-in edges added. (b) The maximal cliques. (c) The junction 
tree. An edge between nodes s and t is labeled by the intersection of the sets on nodes s and t; this is 
called the separating set. From Figure 9.11 of (Koller and Friedman 2009). Used with kind permission of 
Daphne Koller. 


where C(T) are the nodes of the junction tree (which are the cliques of the chordal graph), and 
S(T) are the separators of the tree. To make these equal, we initialize by defining 7, = 1 for 
all separators and e = 1 for all cliques. Then, for each clique in the original model, c € C(G), 
we find a clique in the tree c’ € C(T) which contains it, c' D c. We then multiply Ye onto Ye 
by computing Ye = Ye Pe. After doing this for all the cliques in the original graph, we have 


I] velxe)= [[ ele) (20.60) 


c€C(T) c€C(G) 


As in Section 20.2.1, we now send messages from the leaves to the root and back, as sketched 
in Figure 20.1. In the upwards pass, also known as the collect-to-root phase, node i sends to 
its parent j the following message: 


mi+7(Sis)= XO Ya(Ci) (20.61) 
Ci\ Sij 


That is, we marginalize out the variables that node 7 “knows about” which are irrelevant to j, 
and then we send what is left over. Once a node has received messages from all its children, it 
updates its belief state using 


pilCi) « Wi(Ci) II Mijil Siz) (20.62) 


jEch; 
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At the root, w,-(C;) represents p(xc,.|v), which is the posterior over the nodes in clique 
C, conditioned on all the evidence. Its normalization constant is p(v)/Zo, where Zo is the 
normalization constant for the unconditional prior, p(x). (We have Zo = 1 if the original model 
was a DGM.) 

In the downwards pass, also known as the distribute-from-root phase, node 7 sends to its 
children j the following message: 


_ LCA Si; bi(Ci) 


(20.63) 
5-44 (Siz) 


itis Se) 


We divide out by what j sent to i to avoid double counting the evidence. This requires that we 
store the messages from the upwards pass. Once a node has received a top-down message from 
its parent, it can compute its final belief state using 


W3(Cj) x Wj (Ch) mis; (Siz) (20.64) 


An equivalent way to present this algorithm is based on storing the messages inside the 
separator potentials. So on the way up, sending from 7 to 7 we compute the separator potential 


lS) = XO lC) (20.65) 
CiN Sij 
and then update the recipient potential: 
Py (Cj) x p(o au) a (20.66) 
Vig (Sig) 
(Recall that we initialize baz (Sig) = 1.) This is sometimes called passing a flow from i to j. 
On the way down, from i to j, we compute the separator potential 


vi (Sig) = X yl) (20.67) 
CiN Si j 
and then update the recipient potential: 
ij (Sig) 
DF" (Cy) « yi (Ci) (20.68) 
ae 1> i (Sig) 
This process is known as junction tree calibration. See Figure 20.1 for an illustration. Its 
correctness follows from the fact that each edge partitions the evidence into two distinct groups, 
plus the fact that the tree satisfies RIP, which ensures that no information is lost by only 
performing local computations. 


Example: jtree algorithm on a chain 


It is interesting to see what happens if we apply this process to a chain structured graph such as 
an HMM. A detailed discussion can be found in (Smyth et al. 1997), but the basic idea is this. The 
cliques are the edges, and the separators are the nodes, as shown in Figure 20.8. We initialize 
the potentials as follows: we set Y, = 1 for all the separators, we set Ye(£t—1, £4) = p(z|@4_-1) 
for clique c = (X;,_1, X+) , and we set w(x, yt) = p(yz|2) for clique c = (X;, Y). 
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Figure 20.8 The junction tree derived from an HMM/SSM of length T = 4. 


Next we send messages from left to right. Consider clique (X:—1, X+) with potential 
p(X;|Xz-1). It receives a message from clique (X;~2, X+—1) via separator X,_, of the form 
ae P(Xi-2, X¢-1|Vit—1) = p(X+¢-1|Vi4—1). When combined with the clique potential, 
this becomes the two-slice predictive density 


p(X Xe PUA Via) = PA, Sal (20.69) 


The clique (X;,_1, X+) also receives a message from (X;,Y;) via separator X, of the form 
p(yz|X+), which corresponds to its local evidence. When combined with the updated clique 
potential, this becomes the two-slice filtered posterior 


P(Xt—-1, Xt|Vit—1)p(vi| Xt) = p(Xt-1, Xt |v) (20.70) 


Thus the messages in the forwards pass are the filtered belief states œ+, and the clique potentials 
are the two-slice distributions. In the backwards pass, the messages are the update factors =) 
where ¥;:(k}) = p(a = k|viı:r) and a(k) = p(a = k|vi..). By multiplying by this message, 
we “swap out” the old a, message and “swap in” the new 7, message. We see that the backwards 
pass involves working with posterior beliefs, not conditional likelihoods. See Section 18.3.2.3 for 


further discussion of this difference. 


Computational complexity of JTA 


If all nodes are discrete with K states each, it is clear that the JTA takes O(|C|K“*') time 
and space, where |C| is the number of cliques and w is the treewidth of the graph, i.e., the 
size of the largest clique minus 1. Unfortunately, choosing a triangulation so as to minimize the 
treewidth is NP-hard, as explained in Section 20.3.2. 

The JTA can be modified to handle the case of Gaussian graphical models. The graph-theoretic 
steps remain unchanged. Only the message computation differs. We just need to define how 
to multiply, divide, and marginalize Gaussian potential functions. This is most easily done in 
information form. See e.g., (Lauritzen 1992; Murphy 1998; Cemgil 2001) for the details. The 
algorithm takes O(|C|w?) time and O(|C|w2) space. When applied to a chain structured graph, 
the algorithm is equivalent to the Kalman smoother in Section 18.3.2. 
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Figure 20.9 Encoding a 3-SAT problem on n variables and m clauses as a DGM. The Qs variables are 
binary random variables. The C; variables are deterministic functions of the Qs’s, and compute the truth 
value of each clause. The A; nodes are a chain of AND gates, to ensure that the CPT for the final x node 
has bounded size. The double rings denote nodes with deterministic CPDs. Source: Figure 9.1 of (Koller 
and Friedman 2009). Used with kind permission of Daphne Koller. 


JTA generalizations * 


We have seen how to use the JTA algorithm to compute posterior marginals in a graphical model. 
There are several possible generalizations of this algorithm, some of which we mention below. 
All of these exploit graph decomposition in some form or other. They only differ in terms of 
how they define/ compute messages and “beliefs”. The key requirement is that the operators 
which compute messages form a commutative semiring (see Section 20.3.1). 


e Computing the MAP estimate. We just replace the sum-product with max-product in the 
collect phase, and use traceback in the distribute phase, as in the Viterbi algorithm (Sec- 
tion 17.4.4). See (Dawid 1992) for details. 

e Computing the N-most probable configurations (Nilsson 1998). 

e Computing posterior samples. The collect pass is the same as usual, but in the distribute 
pass, we sample variables given the values higher up in the tree, thus generalizing forwards- 
filtering backwards-sampling for HMMs described in Section 17.4.5. See (Dawid 1992) for 
details. 

e Solving constraint satisfaction problems (Dechter 2003). 

e Solving logical reasoning problems (Amir and Mcllraith 2005). 


Computational intractability of exact inference in the worst case 


As we saw in Sections 20.3.2 and 20.4.3, VE and JTA take time that is exponential in the treewidth 
of a graph. Since the treewidth can be O(number of nodes) in the worst case, this means these 
algorithms can be exponential in the problem size. 

Of course, just because VE and JTA are slow doesn't mean that there isn’t some smarter algo- 
rithm out there. Unfortunately, this seems unlikely, since it is easy to show that exact inference 
is NP-hard (Dagum and Luby 1993). The proof is a simple reduction from the satisfiability prob- 
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Method Restriction Section 
Forwards-backwards Chains, D or LG Section 17.4.3 
Belief propagation Trees, D or LG Section 20.2 
Variable elimination Low treewidth, D or LG, single query Section 20.3 
Junction tree algorithm Low treewidth, D or LG Section 20.4 
Loopy belief propagation Approximate, D or LG Section 22.2 
Convex belief propagation Approximate, D or LG Section 22.4.2 
Mean field Approximate, C-E Section 21.3 
Gibbs sampling Approximate Section 24.2 


Table 20.4 Summary of some methods that can be used for inference in graphical models. “D” means 
that all the hidden variables must be discrete. “L-G” means that all the factors must be linear-Gaussian. 
The term “single query” refers to the restriction that VE only computes one marginal p(x,|xv) at a time. 
See Section 20.3.3 for a discussion of this point. “C-E” stands for “conjugate exponential”; this means that 
variational mean field only applies to models where the likelihood is in the exponential family, and the 
prior is conjugate. This includes the D and LG case, but many others as well, as we will see in Section 21.5. 


lem. In particular, note that we can encode any 3-SAT problem* as a DGM with deterministic 
links, as shown in Figure 20.9. We clamp the final node, x, to be on, and we arrange the CPTs 
so that p(x = 1) > 0 iff there a satisfying assignment. Computing any posterior marginal 
requires evaluating the normalization constant p(x = 1), which represents the probability of the 
evidence, so inference in this model implicitly solves the SAT problem. 

In fact, exact inference is #P-hard (Roth 1996), which is even harder than NP-hard. (See e.g., 
(Arora and Barak 2009) for definitions of these terms.) The intuitive reason for this is that to 
compute the normalizing constant Z, we have to count how many satisfying assignments there 
are. By contrast, MAP estimation is provably easier for some model classes (Greig et al. 1989), 
since, intuitively speaking, it only requires finding one satisfying assignment, not counting all of 
them. 


Approximate inference 


Many popular probabilistic models support efficient exact inference, since they are based on 
chains, trees or low treewidth graphs. But there are many other models for which exact 
inference is intractable. In fact, even simple two node models of the form 0 — x may not 
support exact inference if the prior on @ is not conjugate to the likelihood p(x|@).° 

Therefore we will need to turn to approximate inference methods. See Table 20.4 for a 
summary of coming attractions. For the most part, these methods do not come with any 
guarantee as to their accuracy or running time. Theoretical computer scientists would therefore 
describe them as heuristics rather than approximation algorithms. In fact, one can prove that 


4. A 3-SAT problem is a logical expression of the form (Q1 A Q2 A 7Q3) V (Q1 A7Qa A Qs) ---, where the Q; are 
binary variables, and each clause consists of the conjunction of three variables (or their negation). The goal is to find a 
satisfying assignment, which is a set of values for the Q; variables such that the expression evaluates to true. 

5. For discrete random variables, conjugacy is not a concern, since discrete distributions are always closed under 
conditioning and marginalization. Consequently, graph-theoretic considerations are of more importance when discussing 
inference in models with discrete hidden states. 
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it is not possible to construct polynomial time approximation schemes for inference in general 
discrete GMs (Dagum and Luby 1993; Roth 1996). Fortunately, we will see that for many of these 
heuristic methods often perform well in practice. 


Exercises 


Exercise 20.1 Variable elimination 
Consider the MRF in Figure 10.14(b). 


a. 


Suppose we want to compute the partition function using the elimination ordering <= (1, 2,3, 4, 5,6), 
ie., 


5 5 5 5 5 5 pia(z1, ©2)W13 (£1, ©3) oa (Ho, 4) Y34(£3, z445 (Ha, £5)Y56 (£5, £6 X20.7) 


z5 T4 LZ CQ T1 


If we use the variable elimination algorithm, we will create new intermediate factors. What is the largest 
intermediate factor? 


. Add an edge to the original MRF between every pair of variables that end up in the same factor. (These 


are called fill in edges.) Draw the resulting MRF. What is the size of the largest maximal clique in this 
graph? 


. Now consider elimination ordering <= (4, 1, 2,3, 5,6), i.e. 


> 5 D 5 5 5 W12(@1, £2)p13 (£1, £3)Yp24 (£2, z4) W34 (£3, 4) Was (T4, U5) W56 (#5, £6 ¥20.72) 


TH T3 T2 TI T4 


If we use the variable elimination algorithm, we will create new intermediate factors. What is the largest 
intermediate factor? 


. Add an edge to the original MRF between every pair of variables that end up in the same factor. (These 


are called fill in edges.) Draw the resulting MRF. What is the size of the largest maximal clique in this 
graph? 


Exercise 20.2 Gaussian times Gaussian is Gaussian 


Prove Equation 20.17. Hint: use completing the square. 


Exercise 20.3 Message passing on a tree 


Consider the DGM in Figure 20.10 which represents the following fictitious biological model. Each G; 
represents the genotype of a person: G; = 1 if they have a healthy gene and G; = 2 if they have an 
unhealthy gene. G2 and G3 may inherit the unhealthy gene from their parent G1. X; € R is a continuous 
measure of blood pressure, which is low if you are healthy and high if you are unhealthy. We define the 
CPDs as follows 


p(Gi) = (0.5,0.5] 20.73 
0.9 0.1 
p(G2|Gi) = nm aa 20.74 
0.9 0.1 
p(G3|Gi1) = @ a 20.75 
p(X:|Gi=1) = N(Xilu = 50, 0° = 10) (20.76 
p(X:|Gi=2) = N(Xilu = 60,0? = 10) 20.77 


The meaning of the matrix for p(G2|G1) is that p(G2 = 1|G1 = 1) = 0.9, p(G2 = 1|G1 = 2) = 0.1, 
etc. 
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Xı 


Gy 


G2 G3 


X% X3 


Figure 20.10 A simple DAG representing inherited diseases. 


Suppose you observe X2 = 50, and Xı is unobserved. What is the posterior belief on Gi, i.e., 
p(Gi|X2 = 50)? 


. Now suppose you observe Xə = 50 amd Xs = 50. What is p(G1|X2, X3)? Explain your answer 


intuitively. 


c. Now suppose X2 = 60, Xs = 60. What is p(G1|X2, X3)? Explain your answer intuitively. 


. Now suppose X2 = 50, X3 = 60. What is p(Gi|X1, X2)? Explain your answer intuitively. 


Exercise 20.4 Inference in 2D lattice MRFs 


Consider an MRF with a 2D m x n lattice graph structure, so each hidden node, X;,;, is connected to its 
4 nearest neighbors, as in an Ising model. In addition, each hidden node has its own local evidence, Y;;. 
Assume all hidden nodes have K > 2 states. In general, exact inference in such models is intractable, 
because the maximum cliques of the corresponding triangulated graph have size O(max{m,n}). Suppose 
m <n i.e., the lattice is short and fat. 


a. 


How can one efficiently perform exact inference (using a deterministic algorithm) in such models? (By 
exact inference, I mean computing marginal probabilities P(X;;|i) exactly, where 7 is all the evidence.) 
Give a brief description of your method. 


. What is the asymptotic complexity (running time) of your algorithm? 


. Now suppose the lattice is large and square, so m = n, but all hidden states are binary (ie K = 2). In 


this case, how can one efficiently exactly compute (using a deterministic algorithm) the MAP estimate 
arg max, P(x|y), where x is the joint assignment to all hidden nodes? 


21.1 


Variational inference 


Introduction 


We have now seen several algorithms for computing (functions of) a posterior distribution. For 
discrete graphical models, we can use the junction tree algorithm to perform exact inference, 
as explained in Section 20.4. However, this takes time exponential in the treewidth of the 
graph, rendering exact inference often impractical. For the case of Gaussian graphical models, 
exact inference is cubic in the treewidth. However, even this can be too slow if we have many 
variables. In addition, the JTA does not work for continuous random variables outside of the 
Gaussian case, nor for mixed discrete-continuous variables, outside of the conditionally Gaussian 
case. 

For some simple two node graphical models, of the form x — D, we can compute the 
exact posterior p(x|D) in closed form, provided the prior p(x) is conjugate to the likelihood, 
p(D|x) (which means the likelihood must be in the exponential family). See Chapter 5 for some 
examples of this. (Note that in this chapter, x represent the unknown variables, whereas in 
Chapter 5, we used @ to represent the unknowns.) 

In more general settings, we must use approximate inference methods. In Section 8.4.1, we 
discussed the Gaussian approximation, which is useful for inference in two node models of the 
form x — D, where the prior is not conjugate. (For example, Section 8.4.3 applied the method 
to logistic regression.) 

The Gaussian approximation is simple. However, some posteriors are not naturally modelled 
using Gaussians. For example, when inferring multinomial parameters, a Dirichlet distribution is 
a better choice, and when inferring states in a discrete graphical model, a categorical distribution 
is a better choice. 

In this chapter, we will study a more general class of deterministic approximate inference 
algorithms based on variational inference (Jordan et al. 1998; Jaakkola and Jordan 2000; Jaakkola 
2001; Wainwright and Jordan 2008a). The basic idea is to pick an approximation q(x) to the 
distribution from some tractable family, and then to try to make this approximation as close 
as possible to the true posterior, p*(x) = p(x|D). This reduces inference to an optimization 
problem. By relaxing the constraints and/or approximating the objective, we can trade accuracy 
for speed. The bottom line is that variational inference often gives us the speed benefits of MAP 
estimation but the statistical benefits of the Bayesian approach. 
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Variational inference 


Suppose p*(x) is our true but intractable distribution and q(x) is some approximation, chosen 

from some tractable family, such as a multivariate Gaussian or a factored distribution. We 

assume q has some free parameters which we want to optimize so as to make q “similar to” p*. 
An obvious cost function to to minimize is the KL divergence: 


x 
KL (p*||q) = Le oe (21.1) 
However, this is hard to compute, since taking expectations wrt p* is assumed to be intractable. 
A natural alternative is the reverse KL divergence: 


‘ a(x) 
KL (qllp*) = $` a(x) log a (21.2) 
The main advantage of this objective is that computing expectations wrt q is tractable (by choos- 
ing a suitable form for q). We discuss the statistical differences between these two objectives in 
Section 21.2.2. 

Unfortunately, Equation 21.2 is still not tractable as written, since even evaluating p*(x) = 
p(x|D) pointwise is hard, since it requires evaluating the intractable normalization constant 
Z = p(D). However, usually the unnormalized distribution p(x) p(x, D) = p*(x)Z is 
tractable to compute. We therefore define our new objective function as follows: 


J(q) = KL (qlip) (21.3) 


where we are slightly abusing notation, since p is not a normalized distribution. Plugging in the 
definition of KL, we get 


x 


Ja) = > atx) log S (21.4) 
= J qx)log fe (21.5) 
= X a(x) log Jo — log Z (21.6) 


p* (x) 
= KL (q||p*) —logZ (21.7) 


Since Z is a constant, by minimizing J(q), we will force q to become close to p*. 


x 


Since KL divergence is always non-negative, we see that J(q) is an upper bound on the NLL 
(negative log likelihood): 

J(q) = KL (q||p*) — log Z > — log Z = — log p(D) (21.8) 
Alternatively, we can try to maximize the following quantity (in (Koller and Friedman 2009), this 


is referred to as the energy functional), which is a lower bound on the log likelihood of the 
data: 


L(q) = —J(q) = -KL (q||p*) + log Z < log Z = log p(D) (21.9) 


Since this bound is tight when q = p*, we see that variational inference is closely related to EM 
(see Section 11.4.7). 
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Alternative interpretations of the variational objective 


There are several equivalent ways of writing this objective that provide different insights. One 
formulation is as follows: 


J(q) = E; [log q(x)] + E; [— log p(x)] = —H (4) + E; [E(x)] (21.10) 


which is the expected energy (recall E(x) = — log p(x)) minus the entropy of the system. In 
statistical physics, J(q) is called the variational free energy or the Helmholtz free energy.) 
Another formulation of the objective is as follows: 


J(q) = E, [log q(x) — log p(x)p(P|x)] (21.11) 
= E; [log q(x) — log p(x) — log p(D|x)] (21.12) 
= E,[—logp(P|x)] + KL (¢(x)||p(x)) (21.13) 


This is the expected NLL, plus a penalty term that measures how far the approximate posterior 
is from the exact prior. 

We can also interpret the variational objective from the point of view of information theory 
(the so-called bits-back argument). See (Hinton and Camp 1993; Honkela and Valpola 2004), for 
details. 


Forward or reverse KL? * 


Since the KL divergence is not symmetric in its arguments, minimizing KL (q||p) wrt q will give 
different behavior than minimizing KL (p||q). Below we discuss these two different methods. 

First, consider the reverse KL, KL (q||p), also known as an I-projection or information 
projection. By definition, we have 


q(x) 
KL (q||p) = q(x) In =— (21.14) 
Galle) = > ae) in 2) 
This is infinite if p(x) = 0 and q(x) > 0. Thus if p(x) = 0 we must ensure g(x) = 0. We say 
that the reverse KL is zero forcing for g. Hence q will typically under-estimate the support of 


p. 


x 


Now consider the forwards KL, also known as an M-projection or moment projection: 


KL (p||q) = X` p(x) n me (21.15) 


This is infinite if g(x) = 0 and p(x) > 0. So if p(x) > 0 we must ensure q(x) > 0. We say 
that the forwards KL is zero avoiding for q. Hence q will typically over-estimate the support of 
D. 

The difference between these methods is illustrated in Figure 21.1. We see that when the true 
distribution is multimodal, using the forwards KL is a bad idea (assuming q is constrained to 
be unimodal), since the resulting posterior mode/mean will be in a region of low density, right 
between the two peaks. In such contexts, the reverse KL is not only more tractable to compute, 
but also more sensible statistically. 
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Figure 21.1 Illustrating forwards vs reverse KL on a bimodal distribution. The blue curves are the contours 
of the true distribution p. The red curves are the contours of the unimodal approximation q. (a) Minimizing 
forwards KL: q tends to “cover” p. (b-c) Minimizing reverse KL: q locks on to one of the two modes. Based 
on Figure 10.3 of (Bishop 2006b). Figure generated by KLfwdReverseMixGauss. 


Figure 21.2 Illustrating forwards vs reverse KL on a symmetric Gaussian. The blue curves are the 
contours of the true distribution p. The red curves are the contours of a factorized approximation q. (a) 
Minimizing KL (q||p). (b) Minimizing KL (p||q). Based on Figure 10.2 of (Bishop 2006b). Figure generated 
by KLpqGauss. 


Another example of the difference is shown in Figure 21.2, where the target distribution is 
an elongated 2d Gaussian and the approximating distribution is a product of two 1d Gaussians. 


That is, p(x) = N (x|u, A~'), where 


Ha Air Arp 
= ee 21.16 
mn A e a l l 
In Figure 21.2(a) we show the result of minimizing KL (q||p). In this simple example, one can 
show that the solution has the form 


q(x) = N(a1\m1, A7 )N (alma, Age ) (21.17) 
my = fy — Aq Aia(me = u2) (21.18) 
M2 = p2- A37 Aa (mı = ua) (21.19) 


1. It is called “free” because the variables x are free to vary, rather than being fixed. The variational free energy is a 
function of the distribution q, whereas the regular energy is a function of the state vector x. 
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Figure 21.2(a) shows that we have correctly captured the mean, but the approximation is too 
compact: its variance is controlled by the direction of smallest variance of p. In fact, it is 
often the case (although not always (Turner et al. 2008) that minimizing KL (q||p), where q is 
factorized, results in an approximation that is overconfident. 

In Figure 21.2(b), we show the result of minimizing KL (p||q). As we show in Exercise 21.7, 
the optimal solution when minimizing the forward KL wrt a factored approximation is to set q 
to be the product of marginals. Thus the solution has the form 


g(x) = N (wil, AGEN (alta, Az) (21.20) 


Figure 21.2(b) shows that this is too broad, since it is an over-estimate of the support of p. 

For the rest of this chapter, and for most of the next, we will focus on minimizing KL (q||p). 
In Section 22.5, when we discuss expectation proagation, we will discuss ways to locally optimize 
KL (p||q). 

One can create a family of divergence measures indexed by a parameter a € R by defining 
the alpha divergence as follows: 


4 
Dalplla) (1 - [Palat -aa) (21.21) 


This measure satisfies Da(p||q) = 0 iff p = q, but is obviously not symmetric, and hence is 
not a metric. KL (p||q) corresponds to the limit a — 1, whereas KL (q||p) corresponds to the 
limit a => —1. When a = 0, we get a symmetric divergence measure that is linearly related to 
the Hellinger distance, defined by 


Da(pl|q) = I (pa)? - a(x)?) de (21.22) 


Note that ,/Dy(p||q) is a valid distance metric, that is, it is symmetric, non-negative and 
satisfies the triangle inequality. See (Minka 2005) for details. 


The mean field method 


One of the most popular forms of variational inference is called the mean field approxima- 
tion (Opper and Saad 2001). In this approach, we assume the posterior is a fully factorized 
approximation of the form 


a(x) = [[ ai(x:) (21.23) 
Our goal is to solve this optimization problem: 


min KL (q||p) (21.24) 
4D 


where we optimize over the parameters of each marginal distribution q;. In Section 21.3.1, we 
derive a coordinate descent method, where at each step we make the following update: 


log q; (xj) = E-4; [log p(x)] + const (21.25) 
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Model Section 

Ising model Section 21.3.2 
Factorial HMM Section 21.4.1 
Univariate Gaussian Section 21.5.1 
Linear regression Section 21.5.2 
Logistic regression Section 218.11 
Mixtures of Gaussians Section 21.6.1 


Latent Dirichlet allocation Section 27.3.6.3 


Table 21.1 Some models in this book for which we provide detailed derivations of the mean field inference 
algorithm. 


where p(x) = p(x,D) is the unnormalized posterior and the notation E_,, [f(x)] means to 
take the expectation over f(x) with respect to all the variables except for xj. For example, if 
we have three variables, then 


2q [F(*)] = XL X a(w1)as (ws) f (#1, £2, £3) (21.26) 


Tı T3 


where sums get replaced by integrals where necessary. 

When updating g;, we only need to reason about the variables which share a factor with x;, 
i.e. the terms in j’s Markov blanket (see Section 10.5.3); the other terms get absorbed into the 
constant term. Since we are replacing the neighboring values by their mean value, the method 
is known as mean field. This is very similar to Gibbs sampling (Section 24.2), except instead 
of sending sampled values between neighboring nodes, we send mean values between nodes. 
This tends to be more efficient, since the mean can be used as a proxy for a large number of 
samples. (On the other hand, mean field messages are dense, whereas samples are sparse; this 
can make sampling more scalable to very large models.) 

Of course, updating one distribution at a time can be slow, since it is a form of coordinate 
descent. Several methods have been proposed to speed up this basic approach, including using 
pattern search (Honkela et al. 2003), and techniques based on parameter expansion (Qi and 
Jaakkola 2008). However, we will not consider these methods in this chapter. 

It is important to note that the mean field method can be used to infer discrete or continuous 
latent quantities, using a variety of parametric forms for q;, as we will see below. This is 
in contrast to some of the other variational methods we will encounter later, which are more 
restricted in their applicability. Table 21.1 lists some of the examples of mean field that we cover 
in this book. 


Derivation of the mean field update equations 


Recall that the goal of variational inference is to minimize the upper bound J(q) > — log p(D). 
Equivalently, we can try to maximize the lower bound 

D(x 

L(q) = —J(q) = X` a(x) log ee < log p(D) (21.27) 


We will do this one term at a time. 
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If we write the objective singling out the terms that involve qj, and regarding all the other 
terms as constants, we get 


ING). = 2 e xi) heeze -2log qk (Xk ] (21.28) 
= yal (x;) [Lac (x;) hs )— dow at Ok (Xk ] (21.29) 


Xj Xj iAj 
= dail xj) X [] ai(x:) log bx 

x3 ifj 

= 5 qj (xj) 5 II qi (Xi) 5 log qk (xx) + a5 (x;) (21.30) 
Xj K-35 tj kA 
= 5 q;(x;) log f;(x;) — 5 q;(X;) log q; (xj) + const (21.31) 
where 
log f;(x;) = [Lac x;) log p(x iq, [log p(x)] (21.32) 
x5 ifj 


So we average out all the hidden variables except for x;. Thus we can rewrite L(q;) as follows: 


L(qj) = -KL (qll fj) (21.33) 


We can maximize L by minimizing this KL, which we can do by setting q; = fj, as follows: 


il = 
gix) = zy exp (E-g, [log p(x))) (21.34) 
J 


We can usually ignore the local normalization constant Zj, since we know q; must be a 
normalized distribution. Hence we usually work with the form 


log q; (xj) = E_q, [log p(x)] + const (21.35) 


The functional form of the q; distributions will be determined by the type of variables x,, as 
well as the form of the model. (This is sometimes called free-form optimization.) If x, is a 
discrete random variable, then q; will be a discrete distribution; if x; is a continuous random 
variable, then q; will be some kind of pdf. We will see examples of this below. 


Example: mean field for the Ising model 


Consider the image denoising example from Section 19.4.1, where x; E {—1, +1} are the hidden 
pixel values of the “clean” image. We have a joint model of the form 


p(x,y) = p(x)ply|x) (21.36) 
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where the prior has the form 


1 
P(x) = zy exp(—Eo(x)) (21.37) 
0 
D 
i=1 jenbr; 


and the likelihood has the form 
p(y|x) = [| p(valei) = X exp(-Li(2:)) (21.39) 


Therefore the posterior has the form 


p(xly) = 5 exp(-B(x)) (21.40) 
E(x) = Eo(x)— Y Liles) (21.4) 


We will now approximate this by a fully factored approximation 


a(x) = [J ai, m) (21.42) 


where ju; is the mean value of node i. To derive the update for the variational parameter ui, we 
first write out log p(x) = —E(x), dropping terms that do not involve x;: 


log p(x) = ťi 5 Wiji + Li (xi) + const (21.43) 
jEnbr; 


This only depends on the states of the neighboring nodes. Now we take expectations of this wrt 
I] j4: 9; (2;) to get 


jEnbr; 


Thus we replace the states of the neighbors by their average values. Let 


mi= X Wuj (21.45) 


jEnbri; 


be the mean field influence on node i. Also, let L} = L;(+1) and L7 = L;(—1). The 
approximate marginal posterior is given by 


emit ly 1 
qi(2i ~ 1) ~ emitly of eT mith, ~ 1+ g metl; =L] ~ sigm(2a;) (21.46) 


ai = m;+0.5(L7 — L] ) (21.47) 


21.4 


21.4. Structured mean field * 739 


sample 3, meanfieldH 


(a) (b) (c) 


Figure 21.3 Example of image denoising using mean field (with parallel updates and a damping factor 
of 0.5). We use an Ising prior with W;; = 1 and a Gaussian noise model with o = 2. We show 
the results after 1, 3 and 15 iterations across the image. Compare to Figure 24.1. Figure generated by 
isingImageDenoiseDemo. 


Similarly, we have q;(x; = —1) = sigm(—2a;). From this we can compute the new mean for 
site t: 
1 1 sai -a 
= ES C = tanh(a;) (21.49) 


Lte-2% = 1+ 67% = e% fe~% eTti 4 CM 


Hence the update equation becomes 


we =tanh| X` Wyuj+0.5(L} — L7) (21.50) 


jEnbri 


See also Exercise 21.6 for an alternative derivation of these equations. 
We can turn the above equations in to a fixed point algorithm by writing 


p =tanh| XO Wany’ +0.5(LF — L7) (21.51) 


jEnbri 


It is usually better to use damped updates of the form 


we = (1-A) + Atanh | XO Wau +0.5(LF — L7) (21.52) 
jEnbri 
for 0 < A < 1. We can update all the nodes in parallel, or update them asychronously. 
Figure 21.3 shows the method in action, applied to a 2d Ising model with homogeneous 


attractive potentials, W;; = 1. We use parallel updates with a damping factor of A = 0.5. (If we 
don’t use damping, we tend to get “checkerboard” artefacts.) 


Structured mean field * 


Assuming that all the variables are independent in the posterior is a very strong assumption that 
can lead to poor results. Sometimes we can exploit tractable substructure in our problem, so 


21.4.1 


740 Chapter 21. Variational inference 


(a) (b) (c) 


Figure 21.4 (a) A factorial HMM with 3 chains. (b) A fully factorized approximation. (c) A product-of- 
chains approximation. Based on Figure 2 of (Ghahramani and Jordan 1997). 


that we can efficiently handle some kinds of dependencies. This is called the structured mean 
field approach (Saul and Jordan 1995). The approach is the same as before, except we group sets 
of variables together, and we update them simultaneously. (This follows by simply treating all 
the variables in the i'th group as a single “mega-variable”, and then repeating the derivation in 
Section 21.3.1.) As long as we can perform efficient inference in each q;, the method is tractable 
overall. We give an example below. See (Bouchard-Cote and Jordan 2009) for some more recent 
work in this area. 


Example: factorial HMM 


Consider the factorial HMM model (Ghahramani and Jordan 1997) introduced in Section 17.6.5. 
Suppose there are M chains, each of length T, and suppose each hidden node has K states. 
The model is defined as follows 


p(x, y) = | [ | [ peeml2e—1m)p(velem) (21.53) 
m t 


where p(£itm = k|£t-1,m = j) = Amjk is an entry in the transition matrix for chain m, 
P(Lim = k|£om) = P(£ım = k) = Tmp, is the initial state distribution for chain m, and 


M 
plyix) =N (> XO Waten =) (21.54) 


m=1 


is the observation model, where x;,,, is a l-of-K encoding of 24,, and Wm is a D x K 
matrix (assuming y; € R?). Figure 21.4(a) illustrates the model for the case where M = 3. 
Even though each chain is a priori independent, they become coupled in the posterior due 
to having an observed common child, y+. The junction tree algorithm applied to this graph 
takes O(T MK™*") time. Below we will derive a structured mean field algorithm that takes 
O(TM KI) time, where T is the number of mean field iterations (typically J ~ 10 suffices for 
good performance). 
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We can write the exact posterior in the following form: 


1 


p(xly) = z exp(—E(x,y)) (21.55) 


T T 
E(x, y) = ; 5 (x ~~ 5 Waen = (x ~ 5 Waen 
t=1 m m 
T 
= 5 Kimim a 5 5 XT ÅmXt-1,m (21.56) 


t=2 m 


where A,,, £ log Am and 7, £ log mm (both interpreted elementwise). 

We can approximate the posterior as a product of marginals, as in Figure 21.4(b), but a better 
approximation is to use a product of chains, as in Figure 21.4(c). Each chain can be tractably 
updated individually, using the forwards-backwards algorithm. More precisely, we assume 


M T 
a(xly) = ZL alors) [I aleenlte—a So) (21.57) 
m=1 t=2 
K 
aieltin) = || Garena) (21.58) 
k=1 
K K Tiik 
q(£im|Et-1,m, E51) = II Eimk [| Ana (21.59) 
k=1 j=l 


We see that the tmk parameters play the role of an approximate local evidence, averaging out 
the effects of the other chains. This is contrast to the exact local evidence, which couples all 
the chains together. 

We can rewrite the approximate posterior as g(x) = Z exp(—E,(x)), where 


T M M T M 
E(x) = - 5 3 XimÊtm — 5 Ximõm — 5 5 XimÅmXi—1,m (21.60) 


t=1 m=1 m=1 t=2 m=1 


where &,,,, = log €,,,. We see that this has the same temporal factors as the exact posterior, 
but the local evidence term is different. The objective function is given by 


KL (q||p) = 


where the expectations are taken wrt g. One can show (Exercise 21.8) that the update has the 
form 


[E] — E [E;] — log Z, + log Z (21.61) 


Al 


1 
Emm = exp (WEE Fm = 58m) (21.62) 
bm ê diag( WIE Wm) (21.63) 
M 
Fim = ye- J WE [x] (21.64) 


Lm 


21.5 


21.5.1 


742 Chapter 21. Variational inference 


The €,,, parameter plays the role of the local evidence, averaging over the neighboring chains. 
Having computed this for each chain, we can perform forwards-backwards in parallel, using 
these approximate local evidence terms to compute ¢(Xt,m|Y¥1:r) for each m and t. 

The update cost is O(T M K?) for a full “sweep” over all the variational parameters, since we 
have to run forwards-backwards M times, for each chain independently. This is the same cost 
as a fully factorized approximation, but is much more accurate. 


Variational Bayes 


So far we have been concentrating on inferring latent variables z; assuming the parameters 0 
of the model are known. Now suppose we want to infer the parameters themselves. If we 
make a fully factorized (i.e., mean field) approximation, p(@|D) ~ [J (9x), we get a method 
known as variational Bayes or VB (Hinton and Camp 1993; MacKay 1995a; Attias 2000; Beal 
and Ghahramani 2006; Smidl and Quinn 2005).? We give some examples of VB below, assuming 
that there are no latent variables. If we want to infer both latent variables and parameters, and 
we make an approximation of the form p(0,z1:n|D) ~ q(0) J |; qi(z:), we get a method known 
as variational Bayes EM, which we described in Section 21.6. 


Example: VB for a univariate Gaussian 


Following (MacKay 2003, p429), let us consider how to apply VB to infer the posterior over the 
parameters for a ld Gaussian, p( u, A\D), where A = 1/o? is the precision. For convenience, we 
will use a conjugate prior of the form 


p(y, A) = N (HHO, (KoA) ')Ga(Alao, bo) (21.65) 
However, we will use an approximate factored posterior of the form 


qalu, A) = qa (H) A) (21.66) 


We do not need to specify the forms for the distributions g,, and q); the optimal forms will “fall 
out” automatically during the derivation (and conveniently, they turn out to be Gaussian and 
Gamma respectively). 

You might wonder why we would want to do this, since we know how to compute the 
exact posterior for this model (Section 4.6.3.7). There are two reasons. First, it is a useful 
pedagogical exercise, since we can compare the quality of our approximation to the exact 
posterior. Second, it is simple to modify the method to handle a semi-conjugate prior of the 
form p(u, A) = N (u| Ho, To.) Ga(Alao, bo), for which exact inference is no longer possible. 


2. This method was originally called ensemble learning (MacKay 1995a), since we are using an ensemble of parameters 
(a distribution) instead of a point estimate. However, the term “ensemble learning” is also used to describe methods 
such as boosting, so we prefer the term VB. 
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Target distribution 


The unnormalized log posterior has the form 


log p(u,rA) = log p(u, A, D) = log p(D|p, A) + log p(y) + log p(A) 
N 
_ N À ; 2 KoA 2 
= = leeA PA H) 3 (H — Ho) 


1 
+5 log(KoA) + (ao — 1) log A — bo A + const 


Updating q, (H) 
The optimal form for q,,(j4) is obtained by averaging over A: 


logqu(u) = Eg, [log p(D|u, A) + log p(u|à)] + const 


can LI poe 3 
= -EaP frou- uo) +e- n) | coms 


By completing the square one can show that q, (u) = N (uļun, KẸ), where 


Koto + NT ; 
UNS- FN’ kN = (ko + N)EQ, [A] 


At this stage we don’t know what q)(A) is, and hence we cannot compute 
derive this below. 


Updating qx (A) 


The optimal form for q) (A) is given by 


logqa(A) = Eg, [log p(D|u, A) + log p(u|A) + log p(A)] + const 


il N 
= ao — 1) log À — boà + 5 log À + = log À 


À N 
-gPa jot -= po)? +X (z: - u)? 
=I 


+ const 
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(21.67) 


(21.68) 


(21.69) 


(21.70) 


(21.71) 


z [A], but we will 


(21.72) 


(21.73) 


We recognize this as the log of a Gamma distribution, hence q) (àA) = Ga(Alan, by), where 


an = a+——— 


bn 


II 
(=p 
S 
+ 
NI 
E 
Q 
T 
————a 
a 
ò 
= 
= 
2 
N 
S 
= 
D 
[en 


(21.74) 


(21.75) 
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Computing the expectations 


To implement the updates, we have to specify how to compute the various expectations. Since 
qalu) =N (ulun, KN) we have 


Bow) Wl = uN (21.76) 
f 1 
tan | = —+HuN (21.77) 
KN 


Since q(A\) = Ga(Alan, by), we have 


Nga) A] = = (21.78) 


We can now give explicit forms for the update equations. For q(u) we have 


ae “ot + > T (21.79) 
sno Sy D (21.80) 
and for q(A) we have 
an = Got a (21.81) 
La 
by = bo + Ko(E [|p] + uâ — 2E [u] wo) + 3 5 (a? +E [p°] — 2E [u] x) (21.82) 


w=1 


We see that uy and ay are in fact fixed constants, and only xy and by need to be updated 
iteratively. (In fact, one can solve for the fixed points of ky and by analytically, but we don't 
do this here in order to illustrate the iterative updating scheme.) 


Illustration 


Figure 21.5 gives an example of this method in action. The green contours represent the 
exact posterior, which is Gaussian-Gamma. The dotted red contours represent the variational 
approximation over several iterations. We see that the final approximation is reasonably close to 
the exact solution. However, it is more “compact” than the true distribution. It is often the case 
that mean field inference underestimates the posterior uncertainty; See Section 21.2.2 for more 
discussion of this point. 


Lower bound * 


In VB, we are maximizing L(q), which is a lower bound on the log marginal likelihood: 


L(a) < logp(D) =t0g f f D(Dlu, Apl Adud (21.83) 


It is very useful to compute the lower bound itself, for three reasons. First, it can be used to 
assess convergence of the algorithm. Second, it can be used to assess the correctness of one’s 
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Figure 21.5 Factored variational approximation (red) to the Gaussian-Gamma distribution (green). (a) 
Initial guess. (b) After updating q„. (c) After updating qa. (d) At convergence (after 5 iterations). Based on 
10.4 of (Bishop 2006b). Figure generated by unigaussVbDemo. 


code: as with EM, if the bound does not increase monotonically, there must be a bug. Third, 
the bound can be used as an approximation to the marginal likelihood, which can be used for 
Bayesian model selection. 

Unfortunately, computing this lower bound involves a fair amount of tedious algebra. We 
work out the details for this example, but for other models, we will just state the results without 
proof, or even omit discussion of the bound altogether, for brevity. 

For this model, L(q) can be computed as follows: 


Lla) = J J alu A) log PPLA quqa (21.84) 
= _ E|logp(D|u, )] + E [log p(u|A)] + E [log p(à)] 
—E [log q()] — E [log q(,)] (21.85) 


where all expectations are wrt q(ju, A). We recognize the last two terms as the entropy of a 
Gaussian and the entropy of a Gamma distribution, which are given by 


H(N(un, ky )) = -5 log sy + 5(l + log(27)) (21.86) 
H (Galan, byn)) = logT (an) = (an = 1)w(an) = log(by) + an (21.87) 
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where ¥() is the digamma function. 
To compute the other terms, we need the following facts: 


i flog ala ~ Ga(a,b)] = (a) — log(b) (21.88) 
z [z|e ~ Ga(a,b)}] = (21.89) 

z [zje ~N (u, = u (21.90) 

a [xe ~N (u, o] = u+? (21.91) 


For the expected log likelihood, one can show that 


ta(u,a) log p(P|u, A)| (21.92) 
N 
= -4 log(2m) + 5 g(a) [log A] — Pkw 5 Salu) (ri = u)| 
i=l 
= -> log(27) + 5 (w(ay) — log by) (21.93) 
x — G LT? — 2uNT + u3 4 =) (21.94) 


where z and ô? are the empirical mean and variance. 
For the expected log prior of A, we have 


f(a) log p(A)] = (ao — 1)E [log A] — boE [A] + ao log bo — log F (ao) (21.95) 
(ap — 1)(W(ay) — log by) — boo +aglogby — logIT (ao) (21.96) 
N 


For the expected log prior of u, one can show that 


: 1 K 1_ I 
qalu, A) [log p(u|A)] = z log n T3 [log A] (A) — F alA) [(u = Lo) KoA 
1 Ko 1 
3 log a (2(an) — log by) 
manji agile 
2 big lee + (un — Ho) (21.97) 


Putting it altogether, one can show that 
1 1 
L(q) = 5 log — + logIT (an) — ay log by + const (21.98) 
KN 
This quantity monotonically increases after each VB update. 


Example: VB for linear regression 


In Section 7.6.4, we discussed an empirical Bayes approach to setting the hyper-parameters for 
ridge regression known as the evidence procedure. In particular, we assumed a likelihood of 


the form p(y|X,@) =.NV(Xw, A~*) and a prior of the form p(w) = .V(w|0,a~'I). We then 
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computed a type II estimate of a and À. The same approach was extended in Section 13.7 to 
handle a prior of the form V(w]|0, diag(a)~'), which allows one hyper-parameter per feature, 
a technique known as automatic relevancy determination. 

In this section, we derive a VB algorithm for this model. We follow the presentation of 
(Drugowitsch 2008). Initially we will use the following prior: 


p(w,r,a) = N(wI0, (Aa)~'T)Ga(AlaQ, b3)Ga(alag., bA) (21.99) 
We choose to use the following factorized approximation to the posterior: 
q(w, a, A) = q(w, A)q(a) (21.100) 


Given these assumptions, one can show (see (Drugowitsch 2008)) that the optimal form for the 
posterior is 


q(w,a,A) = N(wlwy,dA~1'Vy)Ga(Alax, bx) Ga(alag, b) (21.101) 
where 
Vy = A+xX* (21.102) 
wy = VyNnXľy (21.103) 
À À N 
a = at (21.104) 
1 — 
DÙ = b+ ailly — Xw||? + wi Aww) (21.105) 
a D 
an = 4 + p (21.106) 
1 aÑ T 
by = bots Z -wywn +tr(Vy) (21.107) 
2 DÀ 
A = (a)I= 2X1 (21.108) 
bN 


This method can be extended to the ARD case in a straightforward way, by using the following 
priors: 


p(w) = N(0,diag(a)~*) (21.109) 
D 

pla) = | I Ga(a;lag, 06) (21.110) 
j=l 


The posterior for w and À is computed as before, except we use A = diag (ayy, / ON, ) instead of 


3. Note that Drugowitsch uses ag, bo as the hyper-parameters for p(A) and co, do as the hyper-parameters for p(a), 
whereas (Bishop 2006b, Sec 10.3) uses ag, bo as the hyper-parameters for p(a) and treats A as fixed. To (hopefully) 
avoid confusion, I use aĝ, bà as the hyper-parameters for p(A), and a&%, b9 as the hyper-parameters for p(@). 
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a, /b% 1. The posterior for œ has the form 


qla) = [[ Ga(ajlah, b%,) (21111) 
j 
Q Q 1 
an = t3 (21.112) 
a 1 (ax 2 
by, = bg + bs wn + (Wn) i (21.113) 


The algorithm alternates between updating g(w, A) and q(@). Once w and A have been 
inferred, the posterior predictive is a Student distribution, as shown in Equation 7.76. Specifically, 
for a single data case, we have 


bA 
plylx, D) = T(ylļw{x, a +x" V yx), 2aÑ) (21.114) 
N 


The exact marginal likelihood, which can be used for model selection, is given by 


= f f | 00X, w A pwapAdwdada 21115) 


We can compute a lower bound on log p(D) as follows: 


N ‘(ar 
L(a) = =y los(27) > ( Y (yi wha)? +27 Vix} 


1 D 
hai ae 
15 og | alts 


aà 
—logT (aò) + a) logbà — bà + logIT (aà) — a4, log bÀ + 07, 
— logIT (aĝ) + ag log bf + logr (ağ) — ay log b\ (21.116) 
In the ARD case, the last line becomes 
3 |- logT (ag) + ag log by + logT (aÑ) — aN log bf, (21.117) 


j=1 


Figure 21.6 compare VB and EB on a model selection problem for polynomial regression. We 
see that VB gives similar results to EB, but the precise behavior depends on the sample size. 
When N = 5, VB’s estimate of the posterior over models is more diffuse than EB’s, since VB 
models uncertainty in the hyper-parameters. When N = 30, the posterior estimate of the hyper- 
parameters becomes more well-determined. Indeed, if we compute E {[a|D] when we have an 
uninformative prior, ag = bf = 0, we get 


a D/2 
qa 2 i (21.118) 


a à 
Dh (Ge waww + tr(Vy)) 


21.6 
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N=5, method=VB N=5, method=EB 
1 1 
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(c) (d) 


Figure 21.6 We plot the posterior over models (polynomials of degree 1, 2 and 3) assuming a uniform 
prior p(m) œx 1. We approximate the marginal likelihood using (a,c) VB and (b,d) EB. In (a-b), we use 
N = 5 data points (shown in Figure 5.7). In (c-d), we use N = 30 data points (shown in Figure 5.8). Figure 
generated by linregEbModelSel1VsN. 


Compare this to Equation 13.167 for EB: 


Â 7 7 (21.119) 
a= = j 
L [wTw] whwy +tr(Vn) 


Modulo the aà and bÀ terms, these are the same. In hindsight this is perhaps not that 
surprising, since EB is trying to maximize logp(D), and VB is trying to maximize a lower 
bound on log p(D). 


Variational Bayes EM 


Now consider latent variable models of the form z; —> x; < 0. This includes mixtures models, 
PCA, HMMs, etc. There are now two kinds of unknowns: parameters, 0, and latent variables, z;. 
As we saw in Section 11.4, it is common to fit such models using EM, where in the E step we 
infer the posterior over the latent variables, p(z;|x;, 0), and in the M step, we compute a point 
estimate of the parameters, 0. The justification for this is two-fold. First, it results in simple 
algorithms. Second, the posterior uncertainty in @ is usually less than in z;, since the 0 are 
informed by all N data cases, whereas z; is only informed by x;; this makes a MAP estimate of 
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0 more reasonable than a MAP estimate of z;. 

However, VB provides a way to be “more Bayesian”, by modeling uncertainty in the parameters 
6 as well in the latent variables z;, at a computational cost that is essentially the same as EM. 
This method is known as variational Bayes EM or VBEM. The basic idea is to use mean field, 
where the approximate posterior has the form 


p(0, z1:n|D) ~ q()q (6) [Tat q(z (21.120) 


The first factorization, between @ and z, is a crucial assumption to make the algorithm tractable. 
The second factorization follows from the model, since the latent variables are iid conditional 
on 0. 

In VBEM, we alternate between updating q(z;|D) (the variational E step) and updating q(@|D) 
(the variational M step). We can recover standard EM from VBEM by approximating the param- 
eter posterior using a delta function, q(@|D) ~ ôa (0). 

The variational E step is similar to a standard E step, except instead of plugging in a MAP 
estimate of the parameters and computing p(z;|D, 6), we need to average over the parameters. 
Roughly speaking, this can be computed by plugging in the posterior mean of the parameters 
instead of the MAP estimate, and then computing p(z;|D, 0) using standard algorithms, such 
as forwards-backwards. Unfortunately, things are not quite this simple, but this is the basic idea. 
The details depend on the form of the model; we give some examples below. 

The variational M step is similar to a standard M step, except instead of computing a point 
estimate of the parameters, we update the hyper-parameters, using the expected sufficient statis- 
tics. This process is usually very similar to MAP estimation in regular EM. Again, the details on 
how to do this depend on the form of the model. 

The principle advantage of VBEM over regular EM is that by marginalizing out the parameters, 
we can compute a lower bound on the marginal likelihood, which can be used for model 
selection. We will see an example of this in Section 21.6.1.6. VBEM is also “egalitarian”, since 
it treats parameters as “first class citizens”, just like any other unknown quantity, whereas EM 
makes an artificial distinction between parameters and latent variables. 


Example: VBEM for mixtures of Gaussians * 


Let us consider how to “fit” a mixture of Gaussians using VBEM. (We use scare quotes since we 
are not estimating the model parameters, but inferring a posterior over them.) We will follow 
the presentation of (Bishop 2006b, Sec 10.2). Unfortunately, the details are rather complicated. 
Fortunately, as with EM, one gets used to it after a bit of practice. (As usual with math, simply 
reading the equations won't help much, you should really try deriving these results yourself (or 
try some of the exercises) if you want to learn this stuff in depth.) 


The variational posterior 


The likelihood function is the usual one for Gaussian mixture models: 


p(z, X|0) = TLD te" Galen Ae 1) Hix (21.121) 


where zig = 1 if data point i belongs to cluster k, and zig = 0 otherwise. 
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We will assume the following factored conjugate prior 


p(9) = Dir( |e) | [Nimmo (80A) *)Wi(Ax|Lo, vo) (21.122) 
k 


where A, is the precision matrix for cluster k. The subscript 0 means these are parameters 
of the prior; we assume all the prior parameters are the same for all clusters. For the mixing 
weights, we usually use a symmetric prior, œo = aol. 

The exact posterior p(z, O/D) is a mixture of K distributions, corresponding to all possible 
labelings z. We will try to approximate the volume around one of these modes. We will use the 
standard VB approximation to the posterior: 


p(8, 21:n|D) ~ q(8) J | (zi) (21.123) 


At this stage we have not specified the forms of the q functions; these will be determined by 
the form of the likelihood and prior. Below we will show that the optimal form is as follows: 


q(z,0) = seso)a0) = [T] eatin) (21.124) 


Dir(alex) | [M (umk, (BA) 1) Wi(Ag| Le, ve) (21.125) 
k 


(The lack of 0 subscript means these are parameters of the posterior, not the prior.) Below we 
will derive the update equations for these variational parameters. 


Derivation of q(z) (variational E step) 


The form for q(z) can be obtained by looking at the complete data log joint, ignoring terms that 
do not involve z, and taking expectations of what’s left over wrt all the hidden variables except 
for z. We have 


logq(z) = Eo) [log p(x, z, @)] + const (21.126) 
= So So xix log pix + const (21.127) 


where we define 


log pix = Eyo) [log rk] + SE) [log | Ax|] — 2 log(2n) 

1 

3 

Using the fact that g(a) = Dir(7r), we have 


log wt, £ E [log mx] = Yar) — YÒ ax’) (21.129) 
k! 


Eq) [(xi — oy)” An (xi — oy) (21.128) 
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where w() is the digamma function. (See Exercise 21.5 for the detailed derivation.) Next, we use 
the fact that 


q(Hys Ax) = N (mg limg, (BAr) *)Wi(Ag|Li, ve) (21.130) 
to get 
2 z vk +l- 
log Ay £ Eflog|Ag|] = X` Y (= £ i) + Dlog2 + log|Axg| (21.131) 
j=1 
Finally, for the expected value of the quadratic form, we get 
a [(xi — Hp)” Ar (xi — My)| = DB; ' + vee(xi — my)” Ag (x; — my) (21.132) 
Putting it altogether, we get that the posterior responsibility of cluster k for datapoint 7 is 
pen D . 
Tik X üri? exp | -— — Pe (xi — m)" A; (xi — m) (21.133) 
20, 2 


Compare this to the expression used in regular EM: 
LRT 1 R N R 
rM oœ fp|Â|? exp (-30 — fi,)* Ag (x; — i) (21.134) 
The significance of this difference is discussed further in Section 21.6.1.7. 


Derivation of q(0) (variational M step) 


Using the mean field recipe, we have 


logq(@) = logp(r )+ 2 aeni Hk, Ap) L E4(z) log p(z:|7)] 


+S Y Ey) [zir] log NM (xip Ag) + const (21.135) 
k i 


We see this factorizes into the form 


a(0) = alm) ] [alr Ax) (21.136) 
k 


For the m term, we have 


logg(7) = (ao-1) > log Tk + 5 5 Tik log Tk + const (21.137) 
k k i 


Exponentiating, we recognize this as a Dirichlet distribution: 
a(n) = Dir(rja) (21.138) 
ak = aot Nz (21.139) 
Ne = Xora (21.140) 


i 
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variational Bayes objective for GMM on old faithful data 


lower bound on log marginal likelihood 


iter 


Figure 21.7 Lower bound vs iterations for the VB algorithm in Figure 21.8. The steep parts of the 
curve correspond to places where the algorithm figures out that it can increase the bound by “killing 
off” unnecessary mixture components, as described in Section 21.6.1.6. The plateaus correspond to slowly 
moving the clusters around. Figure generated by mixGaussVbDemoFaithful. 


For the u, and A; terms, we have 


q(oy, Ak) = N(pglme, (8eAn)~')Wi(Ax|Le, ve) (21.141) 
Be = Bot Nx (21.142) 
m, = (Bomo +N, Xx)/Bp (21.143) 
Ne 
Lo! = L+ NS, + L(x,- Xy — my)" 21.144 
k 0 kk Bo + N; (Xk — mo) (Xk — mo) ( ) 
ve = the (21.145) 
1 
w = a TikXi (21.146) 
1 
S = 2 Tik (Xi — Xp) (xi — Xp)" (21.147) 


This is very similar to the M step for MAP estimation discussed in Section 11.4.2.8, except here 
we are computing the parameters of the posterior over 0, rather than MAP estimates of 0. 
Lower bound on the marginal likelihood 


The algorithm is trying to maximize the following lower bound 
L= 3 fa (z, 0) log me Pez ad < < logp(D) (21.148) 


This quantity should increase monotonically with each iteration, as shown in Figure 21.7. Un- 
fortunately, deriving the bound is a bit messy, because we need to compute expectations of the 
unnormalized log posterior as well as entropies of the q distribution. We leave the details (which 
are similar to Section 21.5.1.6) to Exercise 21.4. 
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Posterior predictive distribution 


We showed that the approximate posterior has the form 


q(@) = Dir(r|a) J [N (aime. (Br A) ')Wi(Ag|Le, Vg) (21.149) 
k 


Consequently the posterior predictive density can be approximated as follows, using the results 
from Section 4.6.3.6: 


p(x|D) x S | vle\=,6)p(2|0)a(0)a0 (21.150) 
- Sf mN Colony, Ag")a(0)a0 (21.151) 
k 
= So T (x|my, Mr, vk +1—D) (21.152) 
m De aw 
(vk +1—D)Bx 
M, = L 21.153 
k IF 6; k ( ) 


This is just a weighted sum of Student distributions. If instead we used a plug-in approximation, 
we would get a weighted sum of Gaussian distributions. 


Model selection using VBEM 


The simplest way to select K when using VB is to fit several models, and then to use the 
variational lower bound to the log marginal likelihood, L(K) < log p(D|K), to approximate 
p(K|D): 


e£(K) 


Yg e£ E” 


However, the lower bound needs to be modified somewhat to take into account the lack of 
identifiability of the parameters (Section 11.3.1). In particular, although VB will approximate the 
volume occupied by the parameter posterior, it will only do so around one of the local modes. 
With K components, there are K! equivalent modes, which differ merely by permuting the 
labels. Therefore we should use log p(D|K) ~ L(K) + log(K!). 


P(K|D) = (21.154) 


Automatic sparsity inducing effects of VBEM 


Although VB provides a reasonable approximation to the marginal likelihood (better than BIC 
(Beal and Ghahramani 2006)), this method still requires fitting multiple models, one for each 
value of K being considered. A faster alternative is to fit a single model, where K is set large, 
but where ao is set very small, ao < 1. From Figure 2.14(d), we see that the resulting prior for 
the mixing weights m has “spikes” near the corners of the simplex, encouraging a sparse mixing 
weight vector. 

In regular EM, the MAP estimate of the mixing weights will have the form 7, œ (az — 1), 
where a, = ao + Nz. Unforuntately, this can be negative if ag = 0 and Nọ = 0 (Figueiredo 
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iter 1 iter 94 


Figure 21.8 We visualize the posterior mean parameters at various stages of the VBEM algorithm applied 
to a mixture of Gaussians model on the Old Faithful data. Shading intensity is proportional to the mixing 
weight. We initialize with K-means and use ao = 0.001 as the Dirichlet hyper-parameter. Based on Figure 
10.6 of (Bishop 2006b). Figure generated by mixGaussVbDemoFaithful, based on code by Emtiyaz Khan. 


iter 1 iter 94 


Figure 21.9 We visualize the posterior values of œx for the model in Figure 21.8. We see that unnecessary 
components get “killed off”. Figure generated by mixGaussVbDemoFaithful. 


and Jain 2002). However, in VBEM, we use 


a, a ert] (21.155) 


exp[Y (diy ar )] 

Now exp(¥(x)) © x — 0.5 for x > 1. So if a, = 0, when we compute ñp, it’s like we substract 
0.5 from the posterior counts. This will hurt small clusters more than large clusters (like a 
regressive tax).’ The effect is that clusters which have very few (weighted) members become 
more and more empty over successive iterations, whereas the popular clusters get more and 
more members. This is called the rich get richer phenomenon; we will encounter it again in 
Section 25.2, when we discuss Dirichlet process mixture models. 

This automatic pruning method is demonstrated in Figure 21.8. We fit a mixture of 6 Gaussians 
to the Old Faithful dataset, but the data only really “needs” 2 clusters, so the rest get “killed off”. 


4. For more details, see (Liang et al. 2007). 
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In this example, we used ag = 0.001; if we use a larger ag, we do not get a sparsity effect. 
In Figure 21.9, we plot g(a|D) at various iterations; we see that the unwanted components 
get extinguished. This provides an efficient alternative to performing a discrete search over the 
number of clusters. 


Variational message passing and VIBES 


We have seen that mean field methods, at least of the fully-factorized variety, are all very similar: 
just compute each node's full conditional, and average out the neighbors. This is very similar 
to Gibbs sampling (Section 24.2), except the derivation of the equations is usually a bit more 
work. Fortunately it is possible to derive a general purpose set of update equations that work for 
any DGM for which all CPDs are in the exponential family, and for which all parent nodes have 
conjugate distributions (Ghahramani and Beal 2001). (See (Wand et al. 2011) for a recent extension 
to handle non-conjugate priors.) One can then sweep over the graph, updating nodes one at a 
time, in a manner similar to Gibbs sampling. This is known as variational message passing or 
VMP (Winn and Bishop 2005), and has been implemented in the open-source program VIBES”. 
This is a VB analog to BUGS, which is a popular generic program for Gibbs sampling discussed 
in Section 24.2.6. 

VMP/ mean field is best-suited to inference where one or more of the hidden nodes are 
continuous (e.g., when performing “Bayesian learning”). For models where all the hidden nodes 
are discrete, more accurate approximate inference algorithms can be used, as we discuss in 
Chapter 22. 


Local variational bounds * 


So far, we have been focusing on mean field inference, which is a form of variational inference 
based on minimizing KL (q||p), where q is the approximate posterior, assumed to be factorized, 
and p is the exact (but unnormalized) posterior. However, there is another kind of variational 
inference, where we replace a specific term in the joint distribution with a simpler function, to 
simplify computation of the posterior. Such an approach is sometimes called a local variational 
approximation, since we are only modifying one piece of the model, unlike mean field, which 
is a global approximation. In this section, we study several examples of this method. 


Motivating applications 
Before we explain how to derive local variational bounds, we give some examples of where this 
is useful. 


Variational logistic regression 


Consider the problem of how to approximate the parameter posterior for multiclass logistic 
regression model under a Gaussian prior. One approach is to use a Gaussian (Laplace) approx- 
imation, as discussed in Section 8.4.3. However, a variational approach can produce a more 


5. Available at http: //vibes.sourceforge.net/. 
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accurate approximation to the posterior, since it has tunable parameters. Another advantage is 
that the variational approach monotonically optimizes a lower bound on the likelihood of the 
data, as we will see. 

To see why we need a bound, note that the likelihood can be written as follows: 


N 
ply|X,w) = | [exp [y/'n,; —lse(n,)] (21.156) 
i=l 


where n; = X;w; = [x} wi,...,x/ wy], where M = C — 1 (since we set wc = 0 for 
identifiability), and where we define the log-sum-exp or lse function as follows: 


M 
Ise(n;) log (: +> ov) (21.157) 
m=1 


The main problem is that this likelihood is not conjugate to the Gaussian prior. Below we discuss 
how to compute “Gaussian-like” lower bounds to this likelihood, which give rise to approximate 
Gaussian posteriors. 


Multi-task learning 


One important application of Bayesian inference for logistic regression is where we have multiple 
related classifiers we want to fit. In this case, we want to share information between the 
parameters for each classifier; this requires that we maintain a posterior distibution over the 
parameters, so we have a measure of confidence as well as an estimate of the value. We can 
embed the above variational method inside of a larger hierarchical model in order to perform 
such multi-task learning, as described in e.g., (Braun and McAuliffe 2010). 


Discrete factor analysis 


Another situation where variational bounds are useful arises when we fit a factor analysis 
model to discrete data. This model is just like multinomial logistic regression, except the input 
variables are hidden factors. We need to perform inference on the hidden variables as well as 
the regression weights. For simplicity, we might perform point estimation of the weights, and 
just integrate out the hidden variables. We can do this using variational EM, where we use the 
variational bound in the E step. See Section 12.4 for details. 


Correlated topic model 


A topic model is a latent variable model for text documents and other forms of discrete data; see 
Section 27.3 for details. Often we assume the distribution over topics has a Dirichlet prior, but 
a more powerful model, known as the correlated topic model, uses a Gaussian prior, which can 
model correlations more easily (see Section 27.4.1 for details). Unfortunately, this also involves 
the lse function. However, we can use our variational bounds in the context of a variational EM 
algorithm, as we will see later. 
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Bohning’s quadratic bound to the log-sum-exp function 


All of the above examples require dealing with multiplying a Gaussian prior by a multinomial 
likelihood; this is difficult because of the log-sum-exp (lse) term. In this section, we derive a way 
to derive a “Gaussian-like” lower bound on this likelihood. 

Consider a Taylor series expansion of the Ise function around a; € R™: 


Iso(m) = Ise(ah,) + (m: = PT g) + 50% — BHO —Y,) e58 
g(v;) = expla, — lse(,)] = S(y,) (21.159) 
H(p;) = diag(g(#;)) — e(ie(vi)” (21.160) 


where g and H are the gradient and Hessian of lse, and 7; € R™ is chosen such that equality 
holds. An upper bound to Ise can be found by replacing the Hessian matrix H(w,;) with a 
matrix A; such that A; < H(w,). (Bohning 1992) showed that this can be achieved if we use 


the matrix A; = 4 [i — malm lm . (Recall that M + 1 = C is the number of classes.) 


Note that A; is independent of ~,; however, we still write it as A; (rather than dropping the 
i subscript), since other bounds that we consider below will have a data-dependent curvature 
term. The upper bound on lse therefore becomes 


1 
Ise(n;) < 57 Aim — bin + ci (21.161) 
A; = l I : iyi. (21.162) 
i = aM Mpi VM l 
b; = Ai; -gy;) (21.163) 
1 
CG = 5 i Aids — B(bi)' Y; + lse(4;) (21.164) 


where p; € R™ is a vector of variational parameters. 
We can use the above result to get the following lower bound on the softmax likelihood: 


1 


To simplify notation, define the pseudo-measurement 


Şi £ A7 (b; +yi) (21.166) 


7 


Then we can get a “Gaussianized” version of the observation model: 
Plyilxi,w) > Fany) N(¥ilXiw, A; *) (21.167) 


where f(x;, ~,;) is some function that does not depend on w. Given this, it is easy to compute 
the posterior q(w) = N (my, Vy), using Bayes rule for Gaussians. Below we will explain how 
to update the variational parameters w,. 
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Applying Bohning’s bound to multinomial logistic regression 


Let us see how to apply this bound to multinomial logistic regression. From Equation 21.13, we 
can define the goal of variational inference as maximizing 


N 
L(q) = -KL (q(w)||p(w|D)) + Eq yeti.) (21.168) 
a 
= -KL (q(w)||p(w|D)) + Eq peg bol) (21.169) 
x i=1 n 
= -KL (q(w)|lp(wID)) + > yi Ey [ni] — $ E4 [se(n;)] (21.170) 


where q(w) = N(w|my, Vw) is the approximate posterior. The first term is just the KL 
divergence between two Gaussians, which is given by 


1 
+(my — mo)" Vo ‘(my — mo) — DM] (21.171) 


where DM is the dimensionality of the Gaussian, and we assume a prior of the form p(w) = 
N (mo, Vo), where typically uo = Opm, and Vo is block diagonal. The second term is simply 


N N 
So yf E, in] = So y mi (21.172) 
i=1 i=l 


where m; = X;my. The final term can be lower bounded by taking expectations of our 
quadratic upper bound on Ise as follows: 


N 
1 ~ 1 
i=l 


where V; £ X;V wo Putting it altogether, we have 


1 = _ = 
Lesa) = -3 ltr(Vn V9 1) = log |Vn V5 | + (my — mo) Vo" (my — mo)]| 


N 
1 2 1 ~ Mss z = 
i=1 
This lower bound combines Jensen’s inequality (as in mean field inference), plus the quadratic 
lower bound due to the lse term, so we write it as Lay. 
We will use coordinate ascent to optimize this lower bound. That is, we update the variational 
posterior parameters V y and my, and then the variational likelihood parameters %;. We leave 
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the detailed derivation as an exercise, and just state the results. We have 


N =j 
(v + 5 xx.) (21.175) 


Vy = 
q= 
N 
4=1 
pi = Mi= Xi;my (21.177) 


We can exploit the fact that A; is a constant matrix, plus the fact that X; has block structure, 
to simplify the first two terms as follows: 


N —1 
Vy = (v +AQ 5 zaf) (21.178) 
į=1 
N 
my = Va (vir +S “(yi + bi) @ =) (21.179) 
i=1 


where &® denotes the kronecker product. See Algorithm 15 for some pseudocode, and http: 
//www.cs.ubc.ca/~emtiyaz/software/catLGM.html for some Matlab code. 


Algorithm 21.1: Variational inference for multi-class logistic regression using Bohning’s 
bound 

Input: y; € {1,...,C}, x; E RP, i = 1 : N, prior mo, Vo ; 
2 Define M := C — 1; dummy encode y; € {0,1}; define X; = blockdiag(x/) ; 
3 Define y := [y1;...;yn], X := [X1;...; Xpy] and A := 4 [iv — malmi ; 


Vis (V3! +5, XTAX:) 3 


= 


W := reshape(m, M, N); 

G := exp(W — lse(W)); 

10 B := AŬ — G; 

u b := (B); 

12 my := Vy (Vo mo + XT (y + b)); 

13 Compute the lower bound La, using Equation 21.174; 


4 
5 Initialize my := mg; 
6 repeat 

7 wb := Xmy; 

8 

9 


4 until converged; 
15 Return my and Vy; 


Bounds for the sigmoid function 


In many models, we just have binary data. In this case, we have y; € {0,1}, M = 1 and 
"y= w!'x; where w € R? is a weight vector (not matrix). In this case, the Bohning bound 
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Bohning bound, y=-2.5 JJ bound, 7=2.5 
: 


(a) (b) 


Figure 21.10 Quadratic lower bounds on the sigmoid (logistic) function. In solid red, we plot sigm(x) vs 
x. In dotted blue, we plot the lower bound L(x, £) vs x for € = 2.5. (a) Bohning bound. This is tight at 
—€ = 2.5. (b) JJ bound. This is tight at € = +2.5. Figure generated by sigmoidLowerBounds. 


becomes 
1 
log +e") < 507 —bnt+e (21.180) 
1 
= $ 21.181 
a 1 ( ) 
b = Av—(1+e*)" (21.182) 
1 
c = ze — (1+ e7”) ty + log(1 + e”) (21.183) 


It is possible to derive an alternative quadratic bound for this case, as shown in Qaakkola and 
Jordan 1996b, 2000). This has the following form 


log(1 +e") < AEP- E) + Fln- E) + log(t + e) (21184) 
AE) ê az tani(G/2) = zg Seme -5| (21.185) 


We shall refer to this as the JJ bound, after its inventors, (Jaakkola and Jordan 1996b, 2000). 
To facilitate comparison with Bohning’s bound, let us rewrite the JJ bound as a quadratic form 
as follows 


log(1 +e”) < alEO b(€)n + clê) (21.186) 
alé) = 2A(€) (21.187) 
b(€) = = (21.188) 
ef) = -AEE sé + log(1 + e£) (21.189) 


The JJ bound has an adaptive curvature term, since a depends on €. In addition, it is tight at 
two points, as is evident from Figure 21.10(b). By contrast, the Bohning bound is a constant 
curvature bound, and is only tight at one point, as is evident from Figure 21.10(a). 


21.8.4 


21.8.4.1 


762 Chapter 21. Variational inference 


If we wish to use the JJ bound for binary logistic regression, we can make some small 
modifications to Algorithm 15. First, we use the new definitions for a;, b; and c;. The fact that 
a; is not constant when using the JJ bound, unlike when using the Bohning bound, means we 
cannot compute V y outside of the main loop, making the method a constant factor slower. 
Next we note that X; = x7, so the updates for the posterior become 


N 
Vy = Vo t+2> NG) xx? (21.190) 
i=1 
a 1 


Finally, to compute the update for €;, we isolate the terms in Lg; that depend on €; to get 


N 
L(g) = > {Insigm(é;) — &;/2 — AlE) (x7 Eq [ww | x; — E7) } + const (21.192) 


Optimizing this wrt é; gives the equation 


0 = X'(&) (x; Eq [ww" | x; — £?) (21.193) 


Now X’(&;) is monotonic for €; > 0, and we do not need to consider negative values of £; by 
symmetry of the bound around €; = 0 (see Figure 21.10). Hence the only way to make the above 


expression 0 is if we have (xT E [ww" | x; — E?) = 0. Hence the update becomes 


(nee? — xT (Viv + mym{ )x; (21.194) 


Although the JJ bound is tighter than the Bohning bound, sometimes it is not tight enough 
in order to estimate the posterior covariance accurately. A more accurate approach, which uses 
a piecewise quadratic upper bound to lse, is described in (Marlin et al. 2011). By increasing the 
number of pieces, the bound can be made arbitrarily tight. 


Other bounds and approximations to the log-sum-exp function * 


There are several other bounds and approximations to the multiclass lse function which we 
can use, which we briefly summarize below. Note, however, that all of these require numerical 
optimization methods to compute my and V y, making them more complicated to implement. 


Product of sigmoids 


The approach in (Bouchard 2007) exploits the fact that 


K K 
os (Soom) a+ Eist ena aus 


k=1 k=1 


It then applies the JJ bound to the term on the right. 
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Jensen’s inequality 


The approach in (Blei and Lafferty 2006a, 2007) uses Jensen’s inequality as follows: 


M 
hq [lse(n;)] = Eg fes (: + Senta) (21.196) 


c=1 


M 

< log (: +S E, tx) (21.197) 
c=1. 
M 1 

< log (: + 55 exp(xfmy,c + 5x V ve) (21.198) 
c=1 


where the last term follows from the mean of a log-normal distribution, which is eto? /2, 


Multivariate delta method 


The approach in (Ahmed and Xing 2007; Braun and McAuliffe 2010) uses the multivariate delta 
method, which is a way to approximate moments of a function using a Taylor series expansion. 
In more detail, let f (w) be the function of interest. Using a second-order approximation around 
m we have 


iat. A See a eee 5 sai eo) (21.199) 


where g and H are the gradient and Hessian evaluated at m. If q(w) = N(w|m, V), we have 


Sal f(w)] ~ f(m) + 5tr[HV] (21.200) 


If we use f(w) = lse(X;w), we get 


1 
z [se(X;w)] ~ lse(X;m) + 5 tt[XiHX; V] (21.201) 


where g and H for the Ise function are defined in Equations 21.159 and 21.160. 


Variational inference based on upper bounds 


So far, we have been concentrating on lower bounds. However, sometimes we need to use an 
upper bound. For example, (Saul et al. 1996) derives a mean field algorithm for sigmoid belief 
nets, which are DGMs in which each CPD is a logistic regression function (Neal 1992). Unlike the 
case of Ising models, the resulting MRF is not pairwise, but contains higher order interactions. 
This makes the standard mean field updates intractable. In particular, they turn out to involve 
computing an expression which requires evaluating 


7 [log( 4 eT Y sepa; WHT )| = E |- log sigm(w/ xpa(i))] (21.202) 


(Notice the minus sign in front.) (Saul et al. 1996) show how to derive an upper bound on the 
sigmoid function so as to make this update tractable, resulting in a monotonically convergent 
inference procedure. 
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Exercises 


Exercise 21.1 Laplace approximation to p(s, log a|D) for a univariate Gaussian 


Compute a Laplace approximation of p(n, log a|D) for a Gaussian, using an uninformative prior p(j1, log o) « 


Exercise 21.2 Laplace approximation to normal-gamma 


Consider estimating u and £ = logo for a Gaussian using an uniformative normal-Gamma prior. The log 
posterior is 


log p(p, LD) = —nloga = [ns + n(y — u)’ (21.203) 
o 


a. Show that the first derivatives are 


g _ n-p) 
Du logp(H, f(D) = => (21.204) 
2 ns? +n(g py)” 
gg log Plh AID) n+ = (21.205) 
b. Show that the Hessian matrix is given by 
a? a? 
H = (# log p(n, £|D) 39t log p(y, sy en 
2z log plu, 4D) 2 log plu, 4D) 
= =y —2n > 
ee ZE -4 (ns? +n(g-— g (21.207) 


c. Use this to derive a Laplace approximation to the posterior p(ju, £D). 


Exercise 21.3 Variational lower bound for VB for univariate Gaussian 
Fill in the details of the derivation in Section 21.5.1.6. 


Exercise 21.4 Variational lower bound for VB for GMMs 
Consider VBEM for GMMs as in Section 21.6.1.4. Show that the lower bound has the following form 


£L = Ellnp(x|z, u, A)] + E [In p(z|7)] + E [In p(r)] + E [In p(y, A)] 
—E [In q(z)| — E [In q(r)] — E [ln q( p, A)] (21.208) 
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where 


1 ‘ - 
E[np(xz mA] = 59M {in Ax — DBz! — vrtr(Sp Lr) 
k 


—v_ (Xk — mk)” Le (Ke — Me) — Din(2n)} 


E[lnp(zlm)] = > dora ina 


E[lnp(r)] = InCair(a@o) + (ao — 1) > In fk 
k 
E[np(u, A) = 5 { D im(8o/27) + 1n Ay = = 

— Bove (Mk — mo)” L (mk — mo) 

+1n Cw: (Lo, vo) + 2 D Zini Jat a) 
E[lng(z)] = Yo rilara 

i k 

Eflngír)] = X (ox — 1) Ina, + ln Cair(a) 

k 


Bing ay = Yo {hind 2in(H) -2 —maiayy) 


where the normalization constant for the Dirichlet and Wishart is given by 


a TQ, or) 
Car(a) = Tete 
Cushy) & WY? (2°?"Pp(v/2)) 
Tola) & „P-D I] T' (a+ (1—4)/2) 


j=l 


where I’p(v) is the multivariate Gamma function. Finally, the entropy of the Wishart is given by 


y-~D—-1, vD 


H(Wi(L,v)) = —lInCw:(L,v)-— [In jal] + “5 


where E [ln |A|] is given in Equation 21.131. 


Exercise 21.5 Derivation of E [log m] under a Dirichlet distribution 
Show that 


exp(U(ax)) 
exp(W(d p ak )) 


exp(E [log 7x]) = 


where m ~ Dir(a). 


Exercise 21.6 Alternative derivation of the mean field updates for the Ising model 


Derive Equation 21.50 by directly optimizing the variational free energy one term at a time. 
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(21.211) 
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(21.213 


(21.214 


(21.215 


(21.216) 


(21.217) 


(21.218) 


(21.219) 


(21.220) 
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Exercise 21.7 Forwards vs reverse KL divergence 


(Source: Exercise 33.7 of (MacKay 2003).) Consider a factored approximation q(x, y) = q(x)q(y) to a joint 
distribution p(x, y). Show that to minimize the forwards KL KL (p||q) we should set q(x) = p(x) and 
q(y) = p(y), ie., the optimal approximation is a product of marginals 


Now consider the following joint distribution, where the rows represent y and the columns zx. 


o ooe 


1 
2 
3 | 0 0 1/4 
4 | 0 0 0 1/4 


Show that the reverse KL KL (q||p) for this p has three distinct minima. Identify those minima and 
evaluate KL (q||p) at each of them. What is the value of KL (q||p) if we set g(x, y) = p(x)p(y)? 


Exercise 21.8 Derivation of the structured mean field updates for FHMM 
Derive the updates in Section 21.4.1. 


Exercise 21.9 Variational EM for binary FA with sigmoid link 
Consider the binary FA model: 


D D 

p(xi|zi,0) = || Ber(aij|sigm(w7 z: + 8;)) = | [ Ber(ais|sigm(m3)) (21.221) 
j=l j=l 

n = Wi; (21.222) 

Zi = (2:31) (21.223) 

WwW ê (WB) (21.224) 

plz) = N(0,1) (21.225) 


Derive an EM algorithm to fit this model, using the Jaakkola-Jordan bound. Hint: the answer is in (Tipping 
1998), but the exercise asks you to derive these equations. 


Exercise 21.10 VB for binary FA with probit link 


In Section 11.4.6, we showed how to use EM to fit probit regression, using a model of the form p(y; = 
1|zi) = (zi > 0), where zi ~ N (wxi, 1) is latent. Now consider the case where the inputs x; are 
also unknown, as in binary factor analysis. Show how to fit this model using variational Bayes, making an 
approximation to the posterior of the form q(x, z, W) = [[~_, a(x:)q(z:) II}; (wi). Hint: (xi) and 
q(w;) will be Gaussian, and q(z;) will be a truncated univariate Gaussian. 


22.1 
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More variational inference 


Introduction 


In Chapter 21, we discussed mean field inference, which approximates the posterior by a product 
of marginal distributions. This allows us to use different parametric forms for each variable, 
which is particularly useful when performing Bayesian inference for the parameters of statistical 
models (such as the mean and variance of a Gaussian or GMM, or the regression weights in a 
GLM), as we saw when we discussed variational Bayes and VB-EM. 

In this chapter, we discuss a slightly different kind of variational inference. The basic idea is 
to minimize J(q) = KL (q||p), where p is the exact but unnormalized posterior as before, but 
where we no longer require q to be factorized. In fact, we do not even require q to be a globally 
valid joint distribution. Instead, we only require that q is locally consistent, meaning that the 
joint distribution of two adjacent nodes agrees with the corresponding marginals (we will define 
this more precisely below). 

In addition to this new kind of inference, we will discuss approximate methods for MAP 
state estimation in discrete graphical models. It turns out that algorithms for solving the MAP 
problem are very similar to some approximate methods for computing marginals, as we will see. 


Loopy belief propagation: algorithmic issues 


There is a very simple approximate inference algorithm for discrete (or Gaussian) graphical 
models known as loopy belief propagation or LBP. The basic idea is extremely simple: we 
apply the belief propagation algorithm of Section 20.2 to the graph, even if it has loops (i.e. 
even if it is not a tree). This method is simple and efficient, and often works well in practice, 
outperforming mean field (Weiss 2001). In this section, we discuss the algorithm in more detail. 
In the next section, we analyse this algorithm in terms of variational inference. 


A brief history 


When applied to loopy graphs, BP is not guaranteed to give correct results, and may not even 
converge. Indeed, Judea Pearl, who invented belief propagation for trees, wrote the following 
about loopy BP in 1988: 


When loops are present, the network is no longer singly connected and local propagation 
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schemes will invariably run into trouble ... If we ignore the existence of loops and 
permit the nodes to continue communicating with each other as if the network were 
singly connected, messages may circulate indefinitely around the loops and the process 
may not converge to a stable equilibrium ... Such oscillations do not normally occur in 
probabilistic networks ... which tend to bring all messages to some stable equilibrium as 
time goes on. However, this asymptotic equilibrium is not coherent, in the sense that it 
does not represent the posterior probabilities of all nodes of the network — (Pearl 1988, 
p.195) 


Despite these reservations, Pearl advocated the use of belief propagation in loopy networks as 
an approximation scheme (J. Pearl, personal communication) and exercise 4.7 in (Pearl 1988) 
investigates the quality of the approximation when it is applied to a particular loopy belief 
network. 

However, the main impetus behind the interest in BP arose when McEliece et al. (1998) showed 
that a popular algorithm for error correcting codes known as turbo codes (Berrou et al. 1993) 
could be viewed as an instance of BP applied to a certain kind of graph. This was an important 
observation since turbo codes have gotten very close to the theoretical lower bound on coding 
efficiency proved by Shannon. (Another approach, known as low density parity check or LDPC 
codes, has achieved comparable performance; it also uses LBP for decoding — see Figure 22.1 
for an example.) In (Murphy et al. 1999), LBP was experimentally shown to also work well for 
inference in other kinds of graphical models beyond the error-correcting code context, and since 
then, the method has been widely used in many different applications. 


LBP on pairwise models 


We now discuss how to apply LBP to an undirected graphical model with pairwise factors (we 
discuss the directed case, which can involve higher order factors, in the next section). The 
method is simple: just continually apply Equations 20.11 and 20.10 until convergence. See 
Algorithm 8 for the pseudocode, and beliefPropagation for some Matlab code. We will 
discuss issues such as convergence and accuracy of this method shortly. 


Algorithm 22.1: Loopy belief propagation for a pairwise MRF 
1 Input: node potentials Ys (xs), edge potentials Yst(£s, £t); 
2 Initialize messages m,-,1(a+) = 1 for all edges s — t; 
3 Initialize beliefs bel,(x,) = 1 for all nodes s; 
4 repeat 
5 Send message on each edge 
Ms—+t(Lt) = i. (CACar ce x) Fliciwry meatia) ) 
6 Update belief of each node bel, (£s) « Ys(£s) Tiesin Misata) 


7 until beliefs dont change significantly; 
Return marginal beliefs bel, (xs); 


œ 
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xı 
W134 
T2 
T3 W256 
T4 W135 
x5 
W246 
x6 


(a) 


Figure 22.1 (a) A simple factor graph representation of a (2,3) low-density parity check code (factor graphs 
are defined in Section 22.2.3.1). Each message bit (hollow round circle) is connected to two parity factors 
(solid black squares), and each parity factor is connected to three bits. Each parity factor has the form 
YstulLs, Et, £u) = I(£s Q £t Q £u = 1), where Q is the xor operator. The local evidence factors for 
each hidden node are not shown. (b) A larger example of a random LDPC code. We see that this graph is 
“locally tree-like”, meaning there are no short cycles; rather, each cycle has length ~ log m, where m is the 
number of nodes. This gives us a hint as to why loopy BP works so well on such graphs. (Note, however, 
that some error correcting code graphs have short loops, so this is not the full explanation.) Source: 
Figure 2.9 from (Wainwright and Jordan 2008b). Used with kind permission of Martin Wainwright. 


LBP on a factor graph 


To handle models with higher-order clique potentials (which includes directed models where 
some nodes have more than one parent), it is useful to use a representation known as a factor 
graph. We explain this representation below, and then describe how to apply LBP to such 
models. 


Factor graphs 


A factor graph (Kschischang et al. 2001; Frey 2003) is a graphical representation that unifies 
directed and undirected models, and which simplifies certain message passing algorithms. More 
precisely, a factor graph is an undirected bipartite graph with two kinds of nodes. Round nodes 
represent variables, square nodes represent factors, and there is an edge from each variable to 
every factor that mentions it. For example, consider the MRF in Figure 22.2(a). If we assume 
one potential per maximal clique, we get the factor graph in Figure 22.2(b), which represents the 
function 


fei, £2, £3, £4) = fi24(£1, £2, £4) fosal Bo, £3, La) (22.1) 


If we assume one potential per edge. we get the factor graph in Figure 22.2(c), which represents 
the function 


(£1, £2, £3, £4) = fia(£1, £4) fi2(@1, £2) f34(£3, £4) fo3 (2, £3) f24(£2, £4) (22.2) 
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(a) (b) (c) 


Figure 22.2 (a) A simple UGM. (b) A factor graph representation assuming one potential per maximal 
clique. (c) A factor graph representation assuming one potential per edge. 


(a) (b) 


Figure 22.3 (a) A simple DGM. (b) Its corresponding factor graph. Based on Figure 5 of (Yedidia et al. 
2001). 


We can also convert a DGM to a factor graph: just create one factor per CPD, and connect that 
factor to all the variables that use that CPD. For example, Figure 22.3 represents the following 
factorization: 


f(®1,%2,%3,%4,%5) = fi (x1) fo(@2) fi23 (#1, £2, £3) f34 (£3, £4) f35 (3, £5) (22.3) 


where we define f193(21, 22,23) = p(#3|21, £2), etc. If each node has at most one parent (and 
hence the graph is a chain or simple tree), then there will be one factor per edge (root nodes 
can have their prior CPDs absorvbed into their children’s factors). Such models are equivalent 
to pairwise MRFs. 
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I 


`---a(æ) \ {A} ti 


Figure 22.4 Message passing on a bipartite factor graph. Square nodes represent factors, and circles 
represent variables. Source: Figure 6 of (Kschischang et al. 2001). Used with kind permission of Brendan 
Frey. 


BP on a factor graph 


We now derive a version of BP that sends messages on a factor graph, as proposed in (Kschis- 
chang et al. 2001). Specifically, we now have two kinds of messages: variables to factors 


Maş ¢(Z) = II Mihel) (22.4) 
hEnbr(x)\ {f} 


and factors to variables: 


mpoo(z)= > Jeyn I mor) (22.5) 
y yEnbr(f)\ {x} 


Here nbr(x) are all the factors that are connected to variable x, and nbr( f) are all the variables 
that are connected to factor f. These messages are illustrated in Figure 22.4. At convergence, 
we can compute the final beliefs as a product of incoming messages: 


bel(x) x [| myse(z) (22.6) 
f€nbr(x) 


In the following sections, we will focus on LBP for pairwise models, rather than for factor 
graphs, but this is just for notational simplicity. 


Convergence 


LBP does not always converge, and even when it does, it may converge to the wrong answers. 
This raises several questions: how can we predict when convergence will occur? what can we do 
to increase the probability of convergence? what can we do to increase the rate of convergence? 
We briefly discuss these issues below. We then discuss the issue of accuracy of the results at 
convergence. 
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Figure 22.5 Illustration of the behavior of loopy belief propagation on an 11 x 11 Ising grid with 
random potentials, wi; ~ Unif(—C,C), where C = 11. For larger C, inference becomes harder. (a) 
Percentage of messasges that have converged vs time for 3 different update schedules: Dotted = damped 
sychronous (few nodes converge), dashed = undamped asychnronous (half the nodes converge), solid = 
damped asychnronous (all nodes converge). (b-f) Marginal beliefs of certain nodes vs time. Solid straight 
line = truth, dashed = sychronous, solid = damped asychronous. Source: Figure 11.C.1 of (Koller and 
Friedman 2009). Used with kind permission of Daphne Koller. 


When will LBP converge? 


The details of the analysis of when LBP will converge are beyond the scope of this chapter, but 
we briefly sketch the basic idea. The key analysis tool is the computation tree, which visualizes 
the messages that are passed as the algorithm proceeds. Figure 22.6 gives a simple example. 
In the first iteration, node 1 receives messages from nodes 2 and 3. In the second iteration, it 
receives one message from node 3 (via node 2), one from node 2 (via node 3), and two messages 
from node 4 (via nodes 2 and 3). And so on. 

The key insight is that T iterations of LBP is equivalent to exact computation in a computation 
tree of height T + 1. If the strengths of the connections on the edges is sufficiently weak, then 
the influence of the leaves on the root will diminish over time, and convergence will occur. See 
(Wainwright and Jordan 2008b) and references therein for more information. 
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4 
(a) 


Figure 22.6 (a) A simple loopy graph. (b) The computation tree, rooted at node 1, after 4 rounds of 
message passing. Nodes 2 and 3 occur more often in the tree because they have higher degree than nodes 
land 2. Source: Figure 8.2 of (Wainwright and Jordan 2008b). Used with kind permission of Martin 
Wainwright. 


Making LBP converge 


Although the theoretical convergence analysis is very interesting, in practice, when faced with a 
model where LBP is not converging, what should we do? 

One simple way to reduce the chance of oscillation is to use damping. That is, instead of 
sending the message ME, we send a damped message of the form 


ME (2s) = AMis(xs) + (1— A)ME (as) (22.7) 


where 0 < A < 1 is the damping factor Clearly if A = 1 this reduces to the standard scheme, 
but for A < 1, this partial updating scheme can help improve convergence. Using a value such 
as A ~ 0.5 is standard practice. The benefits of this approach are shown in Figure 22.5, where 
we see that damped updating results in convergence much more often than undamped updating. 

It is possible to devise methods, known as double loop algorithms, which are guaranteed to 
converge to a local minimum of the same objective that LBP is minimizing (Yuille 2001; Welling 
and Teh 2001). Unfortunately, these methods are rather slow and complicated, and the accuracy 
of the resulting marginals is usually not much greater than with standard LBP. (Indeed, oscillating 
marginals is sometimes a sign that the LBP approximation itself is a poor one.) Consequently, 
these techniques are not very widely used. In Section 22.4.2, we will see a different convergent 
version of BP that is widely used. 


Increasing the convergence rate: message scheduling 


Even if LBP converges, it may take a long time. The standard approach when implementing 
LBP is to perform synchronous updates, where all nodes absorb messages in parallel, and then 
send out messages in parallel. That is, the new messages at iteration k + 1 are computed in 
parallel using 


m*t! = (f,(m*),..., fe(m*)) (22.8) 


where E is the number of edges, and f.;(m) is the function that computes the message for 
edge s — t given all the old messages. This is analogous to the Jacobi method for solving linear 
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systems of equations. It is well known (Bertsekas 1997) that the Gauss-Seidel method, which 
performs asynchronous updates in a fixed round-robin fashion, converges faster when solving 
linear systems of equations. We can apply the same idea to LBP, using updates of the form 

ma = fi ({m}t! : j <i}, {m} : j Sa }) (22.9) 
where the message for edge 7 is computed using new messages (iteration k + 1) from edges 
earlier in the ordering, and using old messages (iteration k) from edges later in the ordering. 

This raises the question of what order to update the messages in. One simple idea is to use 
a fixed or random order. The benefits of this approach are shown in Figure 22.5, where we see 
that (damped) asynchronous updating results in convergence much more often than synchronous 
updating. 

A smarter approach is to pick a set of spanning trees, and then to perform an up-down 
sweep on one tree at a time, keeping all the other messages fixed. This is known as tree 
reparameterization (TRP) (Wainwright et al. 2001), which should not be confused with the more 
sophisticated tree-reweighted BP (often abbreviated to TRW) to be discussed in Section 22.4.2.1. 

However, we can do even better by using an adaptive ordering. The intuition is that we should 
focus our computational efforts on those variables that are most uncertain. (Elidan et al. 2006) 
proposed a technique known as residual belief propagation, in which messages are scheduled 
to be sent according to the norm of the difference from their previous value. That is, we define 
the residual of new message ms at iteration k to be 


r(s,t,k) = || log mst — log mË, Ilo = max | log matli), (22.10) 


mi (i) 
We can store messages in a priority queue, and always send the one with highest residual. When 
a message is sent from s to t, all of the other messages that depend on Mms (i.e., messages of 
the form Miu where u € nbr(t) \ s) need to be recomputed; their residual is recomputed, and 
they are added back to the queue. In (Elidan et al. 2006), they showed (experimentally) that this 
method converges more often, and much faster, than using sychronous updating, asynchronous 
updating with a fixed order, and the TRP approach. 

A refinement of residual BP was presented in (Sutton and McCallum 2007). In this paper, they 
use an upper bound on the residual of a message instead of the actual residual. This means 
that messages are only computed if they are going to be sent; they are not just computed for 
the purposes of evaluating the residual. This was observed to be about five times faster than 
residual BP, although the quality of the final results is similar. 


Accuracy of LBP 


For a graph with a single loop, one can show that the max-product version of LBP will find the 
correct MAP estimate, if it converges (Weiss 2000). For more general graphs, one can bound 
the error in the approximate marginals computed by LBP, as shown in (Wainwright et al. 2003; 
Vinyals et al. 2010). Much stronger results are available in the case of Gaussian models (Weiss 
and Freeman 2001a; Johnson et al. 2006; Bickson 2009). In particular, in the Gaussian case, if 
the method converges, the means are exact, although the variances are not (typically the beliefs 
are over confident). 
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Other speedup tricks for LBP * 


There are several tricks one can use to make BP run faster. We discuss some of them below. 


Fast message computation for large state spaces 


The cost of computing each message in BP (whether in a tree or a loopy graph) is O(KS), 
where K is the number of states, and f is the size of the largest factor (f = 2 for pairwise 
UGMs). In many vision problems (e.g., image denoising), K is quite large (say 256), because 
it represents the discretization of some underlying continuous space, so O(K?) per message 
is too expensive. Fortunately, for certain kinds of pairwise potential functions of the form 
Wst(@s, £t) = Y(t — z+), One can compute the sum-product messages in O(K log K) time 
using the fast Fourier transform or FFT, as explained in (Felzenszwalb and Huttenlocher 2006). 
The key insight is that message computation is just convolution: 


ME (21) = X Ygs — 24) h(25) (22.11) 


where h(xs) = Ys (zs) [Tyenbr(s)\t M*>1(a,). If the potential function (z) is a Gaussian-like 
potential, we can compute the convolution in O(K) time by sequentially convolving with a 
small number of box filters (Felzenszwalb and Huttenlocher 2006). 

For the max-product case, a technique called the distance transform can be used to compute 
messages in O(K) time. However, this only works if ¢)(z) = exp(—E(z)) and where E(z) 
has one the following forms: quadratic, H(z) = 2°; truncated linear, E(z) = min(c,|z|, c2); or 
Potts model, F(z) = c I(z 4 0). See (Felzenszwalb and Huttenlocher 2006) for details. 


Multi-scale methods 


A method which is specific to 2d lattice structures, which commonly arise in computer vision, 
is based on multi-grid techniques. Such methods are widely used in numerical linear algebra, 
where one of the core problems is the fast solution of linear systems of equations; this is 
equivalent to MAP estimation in a Gaussian MRF. In the computer vision context, (Felzenszwalb 
and Huttenlocher 2006) suggested using the following heuristic to significantly speedup BP: 
construct a coarse-to-fine grid, compute messages at the coarse level, and use this to initialize 
messages at the level below; when we reach the bottom level, just a few iterations of standard BP 
are required, since long-range communication has already been achieved via the initialization 
process. 

The beliefs at the coarse level are computed over a small number of large blocks. The local 
evidence is computed from the average log-probability each possible block label assigns to all 
the pixels in the block. The pairwise potential is based on the discrepancy between labels of 
neighboring blocks, taking into account their size. We can then run LBP at the coarse level, 
and then use this to initialize the messages one level down. Note that the model is still a 
flat grid; however, the initialization process exploits the multi-scale nature of the problem. See 
(Felzenszwalb and Huttenlocher 2006) for details. 
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Cascades 


Another trick for handling high-dimensional state-spaces, that can also be used with exact 
inference (e.g., for chain-structured CRFs), is to prune out improbable states based on a com- 
putationally cheap filtering step. In fact, one can create a hierarchy of models which tradeoff 
speed and accuracy. This is called a computational cascade. In the case of chains, one can 
guarantee that the cascade will never filter out the true MAP solution (Weiss et al. 2010). 


Loopy belief propagation: theoretical issues * 


We now attempt to understand the LBP algorithm from a variational point of view. Our presen- 
tation is closely based on an excellent 300-page review article (Wainwright and Jordan 2008a). 
This paper is sometimes called “the monster” (by its own authors!) in view of its length and 
technical difficulty. This section just sketches some of the main results. 

To simplify the presentation, we focus on the special case of pairwise UGMs with discrete 
variables and tabular potentials. Many of the results generalize to UGMs with higher-order clique 
potentials (which includes DGMs), but this makes the notation more complex (see (Koller and 
Friedman 2009) for details of the general case). 


UGMs represented in exponential family form 


We assume the distribution has the following form: 


1 
p(x|8,G) = Zo) S © 0. (as) + 5 Ost(Xs, te) (22.12) 
seV (s,t)€E 


where graph G has nodes VY and edges €. (Henceforth we will drop the explicit conditioning 
on @ and G for brevity, since we assume both are known and fixed.) We can rewrite this in 
exponential family form as follows: 


xi). = za PE) (22.13) 
E(x) ê -—@7¢(x) (22.14) 


where 0 = ({65.;}, {0s,t:j,4}) are all the node and edge parameters (the canonical parameters), 
and $(x) = ({I(a, = j)}, {l(a = j, v_ = k)}) are all the node and edge indicator functions 
(the sufficient statistics). Note: we use s,t € V to index nodes and j,k € 4X to index states. 

The mean of the sufficient statistics are known as the mean parameters of the model, and are 
given by 


H= i [o(x)] = ({p(@s = J)}s, {P(£s = J, Tt = k)}s t) = ET {Ust;jk tst) (22.15) 


This is a vector of length d = |¥||V| + |¥|?|E|, containing the node and edge marginals. 
It completely characterizes the distribution p(x|@), so we sometimes treat yz as a distribution 
itself. 

Equation 22.12 is called the standard overcomplete representation. It is called “overcom- 
plete” because it ignores the sum-to-one constraints. In some cases, it is convenient to remove 
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this redundancy. For example, consider an Ising model where X, € {0,1}. The model can be 
written as 


POX) = Fg P So Ostet SY) bazst (22.16) 


sEV (s,t)EE 
Hence we can use the following minimal parameterization 
(x) = (x5,5 € V; 2521, (s,t) € E) € R? (22.17) 
where d = |V| + |E|. The corresponding mean parameters are us = p(x, = 1) and Hst = 
pias = 1,2, = 1). 
The marginal polytope 


The space of allowable js vectors is called the marginal polytope, and is denoted M(G), where 
G is the structure of the graph defining the UGM. This is defined to be the set of all mean 
parameters for the given model that can be generated from a valid probability distribution: 


M(G)2{weR?:dp st. p= do (x)p(x) for some p(x) > 0, 2 p(x) }(22.18) 


For example, consider an Ising model. If we have just two nodes connected as Xı — X9, 
one can show that we have the following minimal set of constraints: 0 < u12, 0 < p12 < Hi, 
0 < pie < Ho, and 1 + H12 — Hı — H2 > 0. We can write these in matrix-vector form as 


0 0 1 r 0 

1 0 -1 : 0 

aot allel (22.19) 
=f =I. 4 M2 = 


These four constraints define a series of half-planes, whose intersection defines a polytope, 
as shown in Figure 22.7(a). 

Since M(G) is obtained by taking a convex combination of the d(x) vectors, it can also be 
written as the convex hull of the feature set: 


M(G) = conv{¢i(x),..., a(x)} (22.20) 
For example, for a 2 node MRF X, — Xə with binary states, we have 
M(G) = conv{ (0, 0,0), (1,0, 0), (0, 1,0), (1, 1, 1)} (22.21) 


These are the four black dots in Figure 22.7(a). We see that the convex hull defines the same 
volume as the intersection of half-spaces. 

The marginal polytope will play a crucial role in the approximate inference algorithms we 
discuss in the rest of this chapter. 
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Figure 22.7 (a) Illustration of the marginal polytope for an Ising model with two variables. (b) Cartoon 
illustration of the set Mp (G), which is a nonconvex inner bound on the marginal polytope M(G). Mr (G) 
is used by mean field. (c) Cartoon illustration of the relationship between MI(G) and L(G), which is used 
by loopy BP. The set L(G) is always an outer bound on M(G), and the inclusion M(G) C L(G) is strict 
whenever G has loops. Both sets are polytopes, which can be defined as an intersection of half-planes 
(defined by facets), or as the convex hull of the vertices. L(G) actually has fewer facets than M(G), despite 
the picture. In fact, L(G) has O(|4’||V|+|4|?|E]|) facets, where |X| is the number of states per variable, 
|V] is the number of variables, and |E| is the number of edges. By contrast, M(G) has O(|a’|!V!) facets. 
On the other hand, L(G) has more vertices than MI(G), despite the picture, since L(G) contains all the 
binary vector extreme points 2 € M(G), plus additional fractional extreme points. Source: Figures 3.6, 
5.4 and 4.2 of (Wainwright and Jordan 2008a). Used with kind permission of Martin Wainwright. 


Exact inference as a variational optimization problem 


Recall from Section 21.2 that the goal of variational inference is to find the distribution q that 
maximizes the energy functional 


L(q) = -KL (q||p) + log Z = E; [log p(x)] + H (q) < log Z (22.22) 


where p(x) = Zp(x) is the unnormalized posterior. If we write log p(x) = 0’ (x), and we 
let q = p, then the exact energy functional becomes 


67’ w+H 22.23 
oo? p (n) (22.23) 


where u = Ep [(x)] is a joint distribution over all state configurations x (so it is valid to write 
H (y2)). Since the KL divergence is zero when p = q, we know that 
max 6”u +H (p) = log Z(0) (22.24) 
ueM(G) 
This is a way to cast exact inference as a variational optimization problem. 

Equation 22.24 seems easy to optimize: the objective is concave, since it is the sum of a linear 
function and a concave function (see Figure 2.21 to see why entropy is concave); furthermore, we 
are maximizing this over a convex set. However, the marginal polytope M(G) has exponentially 
many facets. In some cases, there is structure to this polytope that can be exploited by dynamic 
programming (as we saw in Chapter 20), but in general, exact inference takes exponential time. 
Most of the existing deterministic approximate inference schemes that have been proposed in 
the literature can be seen as different approximations to the marginal polytope, as we explain 
below. 
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Mean field as a variational optimization problem 


We discussed mean field at length in Chapter 21. Let us re-interpret mean field inference in 
our new more abstract framework. This will help us compare it to other approximate methods 
which we discuss below. 

First, let F be an edge subgraph of the original graph G, and let Z(F) C Z be the subset of 
sufficient statistics associated with the cliques of F. Let Q be the set of canonical parameters 
for the full model, and define the canonical parameter space for the submodel as follows: 


O(F) = {0 EN: 0a =0Ya E T\T(F)} (22.25) 


In other words, we require that the natural parameters associated with the sufficient statistics 
a outside of our chosen class to be zero. For example, in the case of a fully factorized 
approximation, Fo, we remove all edges from the graph, giving 


Q(Fo) £ {0 E Q : Os =0V(s,t) € E} (22.26) 


In the case of structured mean field (Section 21.4), we set 0,, = 0 for edges which are not in 
our tractable subgraph. 
Next, we define the mean parameter space of the restricted model as follows: 


Mp(G) £ {u € R’ : u = Eo [b(x)] for some 6 € Q(F)} (22.27) 


This is called an inner approximation to the marginal polytope, since Mp (G) C M(G). See 
Figure 22.7(b) for a sketch. Note that M-(G) is a non-convex polytope, which results in multiple 
local optima. By contrast, some of the approximations we will consider later will be convex. 
We define the entropy of our approximation H (u(F)) as the entropy of the distribution 
p defined on submodel F. Then we define the mean field energy functional optimization 
problem as follows: 
max 607 +H (u) < log Z(@) (22.28) 
uEMr (G) 
In the case of the fully factorized mean field approximation for pairwise UGMs, we can write 
this objective as follows: 


max `> > Os(s)Ms(%s) + 5 Ost(Ls, Tt)Hs(Ts)Ht(T:) + 5 H (ps) (22.29) 


EPa 
k sEV Ts (5,t)EE Ts,£ sEV 


where u, € P, and P is the probability simplex over 1’. 

Mean field involves a concave objective being maximized over a non-convex set. It is typically 
optimized using coordinate ascent, since it is easy to optimize a scalar concave function over P 
for each us. For example, for a pairwise UGM we get 


ls(£s) x exp(6,(as)) exp 5 XO u(t) Ost(as, 2) (22.30) 


tEnbr(s) T+ 


LBP as a variational optimization problem 


In this section, we explain how LBP can be viewed as a variational inference problem. 
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Figure 22.8 (a) Illustration of pairwise UGM on binary nodes, together with a set of pseudo marginals 
that are not globally consistent. (b) A slice of the marginal polytope illustrating the set of feasible edge 
marginals, assuming the node marginals are clamped at pı = u2 = u3 = 0.5. Source: Figure 4.1 of 
(Wainwright and Jordan 2008a). Used with kind permission of Martin Wainwright. 


An outer approximation to the marginal polytope 


If we want to consider all possible probability distributions which are Markov wrt our model, we 
need to consider all vectors u € M(G). Since the set M(G) is exponentially large, it is usually 
infeasible to optimize over. A standard strategy in combinatorial optimization is to relax the 
constraints. In this case, instead of requiring probability vector yx to live in M(G), we consider 
a vector T that only satisfies the following local consistency constraints: 


<20,) = 1 (22.31) 


XO Talts t) = ne) (22.32) 


The first constraint is called the normalization constraint, and the second is called the marginal- 
ization constraint. We then define the set 


L(G) £ {r > 0 : (22.31) holds Ys € V and (22.32) holds Y(s, t) € E} (22.33) 


The set L(G) is also a polytope, but it only has O(|V| + | E|) constraints. It is a convex outer 
approximation on M(G), as shown in Figure 22.7(c). 

We call the terms Ts, Tst € L(G) pseudo marginals, since they may not correspond to 
marginals of any valid probability distribution. As an example of this, consider Figure 22.8(a). 
The picture shows a set of pseudo node and edge marginals, which satisfy the local consistency 
requirements. However, they are not globally consistent. To see why, note that 712 implies 
p(X, = X2) = 0.8, T23 implies p(X2 = X3) = 0.8, but 713 implies p(X, = X3) = 0.2, which 
is not possible (see (Wainwright and Jordan 2008b, p81) for a formal proof). Indeed, Figure 22.8(b) 
shows that L(G) contains points that are not in M(G). 

We claim that M(G) C L(G), with equality iff G is a tree. To see this, first consider 
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an element u € M(G). Any such vector must satisfy the normalization and marginalization 
constraints, hence M(G') C L(G). 

Now consider the converse. Suppose T is a tree, and let y € L(T). By definition, this satisfies 
the normalization and marginalization constraints. However, any tree can be represented in the 
form 


pulx) = [J uss) JI beat (22.34) 


sEV (s,t)€E Hs(as) text) 


Hence satsifying normalization and local consistency is enough to define a valid distribution for 
any tree. Hence u € M(T) as well. 

In contrast, if the graph has loops, we have that M(G) Æ L(G). See Figure 22.8(b) for an 
example of this fact. 


The entropy approximation 


From Equation 22.34, we can write the exact entropy of any tree structured distribution u € 


M(T) as follows: 


H) = SHu) J Ial) (22.35) 


sEV (s,t)EE 
Hs(us) = — = lts(s) log ts (2s) (22.36) 
LEX, 
Mst (Ls, Xt) 
Talig = st(£s, £4) log —— (22.37) 
ilu t) 5 H t( t) g Ls (Xs) pe (@t) 


(a5 ,04)EXsX Xt 


Note that we can rewrite the mutual information term in the form Ist(ust) = Hs(tus)+Hz (wr) — 
H1.1({1s¢), and hence we get the following alternative but equivalent expression: 


H(w) = -So(ds-1)Hs(us)+ D> Holist) (22.38) 


sEV (s,t)EE 


where ds is the degree (number of neighbors) for node s. 
The Bethe! approximation to the entropy is simply the use of Equation 22.35 even when we 
don't have a tree: 


Hpethe(T) = XC (ts) — 5 Tst(Tst) (22.39) 


sEV (s,t)EE 
We define the Bethe free energy as 
Fpethe(T) = — [077 + Hpetne(7)| (22.40) 
We define the Bethe energy functional as the negative of the Bethe free energy. 


1. Hans Bethe was a German-American physicist, 1906-2005. 
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22.3.5.3 The LBP objective 


Combining the outer approximation L(G) with the Bethe approximation to the entropy, we get 
the following Bethe variational problem (BVP): 
min Fethe(T) = max 077 + Hpethe(T) (22.41) 

TEL(G) TEL(G) 
The space we are optimizing over is a convex set, but the objective itself is not concave (since 
Hethe is not concave). Thus there can be multiple local optima of the BVP. 

The value obtained by the BVP is an approximation to log Z(0). In the case of trees, the 
approximation is exact, and in the case of models with attractive potentials, the approximation 
turns out to be an upper bound (Sudderth et al. 2008). 


22.3.5.4 Message passing and Lagrange multipliers 


In this subsection, we will show that any fixed point of the LBP algorithm defines a stationary 
point of the above constrained objective. Let us define the normalization constraint at Css (T) + 
1— DA Ts(£s), and the marginalization constraint as Cts (£s; T)  Ts(£s) — Da Tst(Xs, £t) 
for each edge t — s. We can now write the Lagrangian as 


L(T,A; 0) & 077+ Hpethe(T) +X. AssCss(T) 


> Dsl (x5)Cts(ae3T) + PES a (x1)Cst(24; T) (22.42) 


(The constraint that T > 0 is not explicitly enforced, but one can show that it will hold at the 
optimum since 0 > 0.) Some simple algebra then shows that VL = 0 yields 


log Te(£s) = Ass +0s(zs) + DS) As(zs) (22.43) 
tEnbr(s) 
Tst(£s, £t) _ _ o 
Pee A A = Os:(@5, 2t) — Atxs(Ls) — Ase (Le) (22.44) 


where we have defined Fs (xs) = D>, (as, +). Using the fact that the marginalization con- 
straint implies 7,(x,) = Ts(£s), we get 


log Tst(£s, £t) = Ass + Att + Ose(Ws, £t) + Os(@s) + Oe (Lt) 
+ So Avclws)+ XO Aula) (22.45) 
w€nbr(s)\t wE€nbr(t)\s 
To make the connection to message passing, define Mis(£s) = exp(Az5(a;)). With this 
notation, we can rewrite the above equations (after taking exponents of both sides) as follows: 
Ts(£s) x exp(0 ) lI Mis (as) (22.46) 
yi 
Tst (Us, £t) ox exp (Gar tasty) + 05(x5) + 6:(x+)) 


x [[ Mess) J| Mual (22.47) 


uEnbr(s)\t uEnbr(t)\s 
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where the A terms are absorbed into the constant of proportionality. We see that this is 
equivalent to the usual expression for the node and edge marginals in LBP. 

To derive an equation for the messages in terms of other messages (rather than in terms of 
Ats), we enforce the marginalization condition ae Tst(Ls,@t) = Ts(£s). Then one can show 
that 


Mis(as) x XC fexp {Ose(xs, 24) +02} [[ Maley (22.48) 


Lt w€nbr(t)\s 


We see that this is equivalent to the usual expression for the messages in LBP. 


Loopy BP vs mean field 


It is interesting to compare the naive mean field (MF) and LBP approximations. There are several 
obvious differences. First, LBP is exact for trees whereas MF is not, suggesting LBP will in general 
be more accurate (see (Wainwright et al. 2003) for an analysis). Second, LBP optimizes over node 
and edge marginals, whereas MF only optimizes over node marginals, again suggesting LBP will 
be more accurate. Third, in the case that the true edge marginals factorize, so 4., = HsH the 
free energy approximations will be the same in both cases. 

What is less obvious, but which nevertheless seems to be true, is that the MF objective has 
many more local optima than the LBP objective, so optimizing the MF objective seems to be 
harder. In particular, (Weiss 2001), shows empirically that optimizing MF starting from uniform 
or random initial conditions often leads to poor results, whereas optimizing BP from uniform 
initial messages often leads to good results. Furthermore, initializing MF with the BP marginals 
also leads to good results (although MF tends to be more overconfident than BP), indicating that 
the problem is caused not by the inaccuracy of the MF approximation, but rather by the severe 
non-convexity of the MF objective, and by the weakness of the standard coordinate descent 
optimization method used by ME” However, the advantage of MF is that it gives a lower bound 
on the partition function, unlike BP, which is useful when using it as a subroutine inside a 
learning algorithm. Also, MF is easier to extend to other distributions besides discrete and 
Gaussian, as we saw in Chapter 21. Intuitively, this is because MF only works with marginal 
distributions, which have a single type, rather than needing to define pairwise distributions, 
which may need to have two different types. 


Extensions of belief propagation * 


In this section, we discuss various extensions of LBP. 


Generalized belief propagation 


We can improve the accuracy of loopy BP by clustering together nodes that form a tight loop. 
This is known as the cluster variational method. The result is a hyper-graph, which is a graph 


2. (Honkela et al. 2003) discusses the use of the pattern search algorithm to speedup mean field inference in the case 
of continuous random variables. It is possible that similar ideas could be adapted to the discrete case, although there 
may be no reason to do this, given that LBP already works well in the discrete case. 
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Figure 22.9 (a) Kikuchi clusters superimposed on a 3 x 3 lattice graph. (b) Corresponding hyper-graph. 
Source: Figure 4.5 of (Wainwright and Jordan 2008b). Used with kind permission of Martin Wainwright. 


where there are hyper-edges between sets of vertices instead of between single vertices. Note 
that a junction tree (Section 20.4.1) is a kind of hyper-graph. We can represent hyper-graph using 
a poset (partially ordered set) diagram, where each node represents a hyper-edge, and there is 
an arrow € — €2 if e2 C e1. See Figure 22.9 for an example. 

Let t be the size of the largest hyper-edge in the hyper-graph. If we allow t to be as large as 
the treewidth of the graph, then we can represent the hyper-graph as a tree, and the method 
will be exact, just as LBP is exact on regular trees (with treewidth 1). In this way, we can define 
a continuum of approximations, from LBP all the way to exact inference. 

Define L;(G) to be the set of all pseudo-marginals such that normalization and marginaliza- 
tion constraints hold on a hyper-graph whose largest hyper-edge is of size t + 1. For example, 
in Figure 22.9, we impose constraints of the form 


5 Ti245 (£1, £2, L4, L5) = Tas (T4, 25), XC Tse(£5, z6) = T5(£5),... (22.49) 


T1,T2 T6 


Furthermore, we approximate the entropy as follows: 


Hxikuchi(T) + 5 c(g)Hg(T,) (22.50) 
gEE 


where H, (Tg) is the entropy of the joint (pseudo) distribution on the vertices in set g, and c(g) 
is called the overcounting number of set g. These are related to Mobious numbers in set 
theory. Rather than giving a precise definition, we just give a simple example. For the graph in 
Figure 22.9, we have 


Hkicuchi(T) = [H1245 + H2356 + H4578 + H5689] 
—[Hos + Has + Hs6 + Hss] + Hs (22.51) 


Putting these two approximations together, we can define the Kikuchi free energy’ as follows: 
Fyikuchi(T) = — lo" + Hkikuchi(T) (22.52) 


3. Ryoichi Kikuchi is a Japanese physicist. 
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Our variational problem becomes 


min Fkikuchi(T) = max oTr + Hkikuchi(T) (22.53) 
TEL: (G) TEL: (G) 

Just as with the Bethe free energy, this is not a concave objective. There are several possible 
algorithms for finding a local optimum of this objective, including a message passing algorithm 
known as generalized belief propagation. However, the details are beyond the scope of this 
chapter. See e.g., (Wainwright and Jordan 2008b, Sec 4.2) or (Koller and Friedman 2009, Sec 
11.3.2) for more information. Suffice it to say that the method gives more accurate results than 
LBP, but at increased computational cost (because of the need to handle clusters of nodes). This 
cost, plus the complexity of the approach, have precluded it from widespread use. 


Convex belief propagation 


The mean field energy functional is concave, but it is maximized over a non-convex inner 
approximation to the marginal polytope. The Bethe and Kikuchi energy functionals are not 
concave, but they are maximized over a convex outer approximation to the marginal polytope. 
Consequently, for both MF and LBP, the optimization problem has multiple optima, so the 
methods are sensitive to the initial conditions. Given that the exact formulation (Equation 22.24) 
a concave objective maximized over a convex set, it is natural to try to come up with an 
appproximation which involves a concave objective being maximized over a convex set. 

We now describe one method, known as convex belief propagation. This involves working 
with a set of tractable submodels, F, such as trees or planar graphs. For each model F C G, 
the entropy is higher, H (u(F)) > H (u(G)), since F has fewer constraints. Consequently, any 
convex combination of such subgraphs will have higher entropy, too: 


H(u(G)) < So p(F)H (u(F)) ê H(p, p) (22.54) 
FEF 


where p(F) > 0 and $` p p(F) = 1. Furthermore, H(z, p) is a concave function of u. We now 
define the convex free energy as 


Fconvex(H, p) £ [uO + H(n, p)| (22.55) 


We define the concave energy functional as the negative of the convex free energy. We discuss 
how to optimize p below. 

Having defined an upper bound on the entropy, we now consider a convex outerbound on 
the marginal polytope of mean parameters. We want to ensure we can evaluate the entropy of 
any vector T in this set, so we restrict it so that the projection of T onto the subgraph G lives 
in the projection of M onto F: 


L(G; F) ê {r E€ R! : r(F) € M(F) VF € F} (22.56) 
This is a convex set since each M(F) is a projection of a convex set. Hence we define our 


problem as 


i FOonvez T, = a 7'6+H T, 22.57 
rer) malna = | no) se 


This is a concave objective being maximized over a convex set, and hence has a unique maxi- 
mum. We give a specific example below. 
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Figure 22.10 (a) A graph. (b-d) Some of its spanning trees. Source: Figure 7.1 of (Wainwright and Jordan 
2008b). Used with kind permission of Martin Wainwright. 


Tree-reweighted belief propagation 


Consider the specific case where F is all spanning trees of a graph. For any given tree, the 
entropy is given by Equation 22.35. To compute the upper bound, obtained by averaging over 
all trees, note that the terms J` p p(F)H(u(F)s) for single nodes will just be H., since node s 
appears in every tree, and J` p p(F) = 1. But the mutual information term J,; receives weight 
Pst = Ep {I((s,t) € E(Z))], known as the edge appearance probability. Hence we have the 
following upper bound on the entropy: 


H (u) < XO He(us)— XO pstlst(ust) (22.58) 


sEV (s,t)€E 


The edge appearance probabilities live in a space called the spanning tree polytope. This 
is because they are constrained to arise from a distribution over trees. Figure 22.10 gives an 
example of a graph and three of its spanning trees. Suppose each tree has equal weight under 
p. The edge f occurs in 1 of the 3 trees, so py = 1/3. The edge e occurs in 2 of the 3 trees, 
so pe = 2/3. The edge b appears in all of the trees, so pp = 1. And so on. Ideally we can 
find a distribution p, or equivalently edge probabilities in the spanning tree polytope, that make 
the above bound as tight as possible. An algorithm to do this is described in (Wainwright et al. 
2005). (A simpler approach is to generate spanning trees of G at random until all edges are 
covered, or use all single edges with weight pe = 1/E,) 

What about the set we are optimizing over? We require (T) € M(T) for each tree T, which 
means enforcing normalization and local consistency. Since we have to do this for every tree, 
we are enforcing normalization and local consistency on every edge. Hence L(G; F) = L(G). 
So our final optimization problem is as follows: 


max ¢77@4 5 H: (Ts) — 5 PstIst(Tst) (22.59) 
sEV (s,t)€E(G) 


which is the same as the LBP objective except for the crucial pst weights. So long as pst > 0 
for all edges (s,t), this problem is strictly concave with a unique maximum. 

How can we find this global optimum? As for LBP, there are several algorithms, but perhaps the 
simplest is a modification of belief propagation known as tree reweighted belief propagation, 
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also called TRW or TRBP for short. The message from t to s is now a function of all messages 
sent from other neighbors v to t, as before, but now it is also a function of the message sent 
from s to t. Specifically 


1 Toenbr(t)\s Mot (te) |? 
Mis(as) «x J exp | —0st(£s, £t) + 0ilx (22.60) 
ts(Zs) p (= ( t) i( ») [Met (11) 7 


Tt 


At convergence, the node and edge pseudo marginals are given by 


Ts(te) x exp(ðs(zs)) J| Mos(z) (22.61) 
vEnbr(s) 
Iuenbresj t Mos (z5)]?v Toenbr(t)\sMve(xe)]?”* 
S S3 S 8) 22. 2 
Talent) Pata, E T 7 PE [Mae G 
1 
Pst(Ts, x) = exp Grace Lt) + Os(£s) + Men) (22.63) 
st 


This algorithm can be derived using a method similar to that described in Section 22.3.5.4. 

If pst = 1 for all edges (s,t) € E, the algorithm reduces to the standard LBP algorithm. 
However, the condition pst = 1 implies every edge is present in every spanning tree with 
probability 1, which is only possible if the original graph is a tree. Hence the method is only 
equivalent to standard LBP on trees, when the method is of course exact. 

In general, this message passing scheme is not guaranteed to converge to the unique global 
optimum. One can devise double-loop methods that are guaranteed to converge (Hazan and 
Shashua 2008), but in practice, using damped updates as in Equation 22.7 is often sufficient to 
ensure convergence. 

It is also possible to produce a convex version of the Kikuchi free energy, which one can 
optimize with a modified version of generalized belief propagation. See (Wainwright and Jordan 
2008b, Sec 7.2.2) for details. 

From Equation 22.59, and using the fact that the TRBP entropy approximation is an upper 
bound on the true entropy, wee see that the TRBP objective is an upper bound on log Z. Using 
the fact that Ist = Hs + H, — H.+, we can rewrite the upper bound as follows: 


log Z(0) £770 +S pstHst(Tst) + X csHs(Ts) < log Z(0) (22.64) 
st s 
where c, £ 1 — X; Pst: 


Expectation propagation 


Expectation propagation (EP) (Minka 2001c) is a form of belief propagation where the mes- 
sages are approximated. It is a generalization of the assumed density filtering (ADF) algorithm, 
discussed in Section 18.5.3. In that method, we approximated the posterior at each step using 
an assumed functional form, such as a Gaussian. This posterior can be computed using mo- 
ment matching, which locally optimizes KL (p||q) for a single term. From this, we derived the 
message to send to the next time step. 


22.5.1 


788 Chapter 22. More variational inference 


ADF works well for sequential Bayesian updating, but the answer it gives depends on the 
order in which the data is seen. EP essentially corrects this flaw by making multiple passes over 
the data (thus EP is an offline or batch inference algorithm). 


EP as a variational inference problem 


We now explain how to view EP in terms of variational inference. We follow the presentation of 
(Wainwright and Jordan 2008b, Sec 4.3), which should be consulted for further details. 
Suppose the joint distribution can be written in exponential family form as follows: 


p(x|8,6) x fo(x) exp(O7 o( JICH x)) (22.65) 


where we have partitioned the parameters and the sufficient statistics into a tractable term 0 of 
size dr and dy intractable terms ĝ;, each of size b. 

For example, consider the problem of inferring an unknown vector x, when the observation 
model is a mixture of two Gaussians, one centered at x and one centered at 0. (This can be 
used to represent outliers, for example.) Minka (who invented EP) calls this the clutter problem. 
More formally, we assume an observation model of the form 


pP(y|x) = (1 — w)N(y|x, I) + wN (y|0, a1) (22.66) 


where 0 < w < 1 is the known mixing weight (fraction of outliers), and a > 0 is the variance 
of the background distribution. Assuming a fixed prior of the form p(x) = N (x|0, ©), we can 
write our model in the required form as follows: 


P(xlyi:n) «x N(x 


(yi|x) (22.67) 
i=1 


N 
1 
= exp (-772x) exp (>. a) (22.68) 
i=1 


This matches our canonical form where fo (x) exp(0” o(x)) corresponds to a C tee), 
using @(x) = (x, xx”), and we set ®;(x) = log p(y;|x), 0; = 1, and dr = 
The exact inference problem corresponds to 


max 77604+7'70+H((r,7)) (22.69) 
(7,7)EM(¢,®) 


where M (œ, ®) is the set of mean parameters realizable by any probability distribution as seen 
through the eyes of the sufficient statistics: 


M(o,®) = {(u, fr) E R x RY : (u, A) = E (X), @1(X),..-, Ba, (X))]} 2270 


As it stands, it is intractable to perform inference in this distribution. For example, in our 


clutter example, the posterior contains 2% modes. But suppose we incorporate just one of the 
intractable terms, say the i'th one; we will call this the ®;-augmented distribution: 


p(x|9,8;) ox fo(x) exp(67 G(x) exp; ®; (x) (22.71) 
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In our clutter example, this becomes 


~ 1 
p(x|9,0;) = exp (-5x7= x) [wN (y:|0, al) + (1 — w)N (y:|x, I)] (22.72) 
This is tractable to compute, since it is just a mixture of 2 Gaussians. 
The key idea behind EP is to work with these the ®;-augmented distributions in an iterative 
fashion. First, we approximate the convex set M (œ, ®) with another, larger convex set: 


L(g, ®) = {(7,7): 7 E M(@), (T, Ži) €E M(o, ®i)} (22.73) 


where M(o) = {u € R : p = E[p(X)]} and M(o,®,) = {(y,ft,) € RIF x RÈ : 
(u, ki) = E [(@(X), ®;(X))]. Next we approximate the entropy by the following term-by-term 
approximation: 


Hep(T, 7#) £ H (7) + X [H (7, 7:) — H (7)] (22.74) 


Then the EP problem becomes 


max — TTO +770 +HelT, 7) (22.75) 
(T,#)EL(p, E) 


Optimizing the EP objective using moment matching 


We now discuss how to maximize the EP objective in Equation 22.75. Let us duplicate T dy 
times to yield 7, = T. The augmented set of parameters we need to optimize is now 


(T, (ny #:)221) E R? x (R? x R’) (22.76) 


subject to the constraints that 7; = 7 and (n;, 7i) E€ M(@; ®;). Let us associate a vector of 
Lagrange multipliers A; € RIT with the first set of constraints. Then the partial Lagrangian 
becomes 


di 
Lien) = 779 +H (r) + >> [F76; +H (0n, Fi) -H (n) + 7 (7 — m) (22.77) 


i=l 


By solving V+L(T; A) = 0, we can show that the corresponding distribution in M(¢) has 
the form 


dr 
q(x|8, A) x fo(x) exp{ (0 + $ > Ai)" P(x)} (22.78) 


i=l 


The AT b(x) terms represents an approximation to the i'th intractable term using the sufficient 


statistics from the base distribution, as we will see below. Similarly, by solving V (n,,+,) L(7; A) = 
0, we find that the corresponding distribution in M(@, ®;) has the form 
~ -T 
gi(x|0,0:, A) x fo(x) exp{ (0 + X Az)" P(x) +8; Bi(x)} (22.79) 


j+i 
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This corresponds to removing the approximation to the i'th term, ;, from the base distribution, 
and adding in the correct ith term, ®;. Finally, Va L(T; A) = 0 just enforces the constraints 
that r = E; [@(X)] and n; = Eg, [(X)] are equal. In other words, we get the following 
moment matching constraints: 


J q(x|0, A)ġ(x)dx = J qi(x|0, 0;, A\)b(x)dx (22.80) 


Thus the overall algorithm is as follows. First we initialize the A;. Then we iterate the following 
to convergence: pick a term i; compute qi (corresponding to removing the old approximation 
to ®; and adding in the new one); then update the A; term in q by solving the moment 
matching equation E,, [b(X)] = E; [@(X)]. (Note that this particular optimization scheme is 
not guaranteed to converge to a fixed point.) 

An equivalent way of stating the algorithm is as follows. Let us assume the true distribution 
is given by 


p(x|D) = z IL filx (22.81) 


We approximate each f; by f; and set 
1 7 
=> I] fi(x) (22.82) 


Now we repeat the following until convergence: 
1. Choose a factor f; to refine. 


2. Remove f; from the posterior by dividing it out: 


q_-i(x) = ~ (22.83) 


This can be implemented by substracting off the natural parameters of f; from q. 
3. Compute the new posterior g"°’ (x) by solving 
. 1 mew 
min KL | + fi(x)q-i(x)||¢"°” (x) (22.84) 
qe’ (x) Zi 


This can be done by equating the moments of q”°® (x) with those of q;(x) o q_;(x)fi(x). 
The corresponding normalization constant has the form 


Zi = f hdx (22.85) 
4. Compute the new factor (message) that was implicitly used (so it can be later removed): 

Ea O grae (x) 

filx) = Zi (22.86) 


q-i(x) 
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After convergence, we can approximate the marginal likelihood using 
p(D) ~ / [| fi@ax (22.87) 
We will give some examples of this below which will make things clearer. 


EP for the clutter problem 


Let us return to considering the clutter problem. Our presentation is based on (Bishop 2006b).* 
For simplicity, we will assume that the prior is a spherical Gaussian, p(x) = M (0, bI). Also, we 
choose to approximate the posterior by a spherical Gaussian, g(x) = M (m, vI). We set fo(x) 
to be the prior; this can be held fixed. The factor approximations will be “Gaussian like” terms 
of the form 


filx) = siN (x|m;, vI) (22.88) 


Note, however, that in the EP updates, the variances may be negative! Thus these terms should 
be interpreted as functions, but not necessarily probability distributions. (If the variance is 
negative, it means the that f; curves upwards instead of downwards.) 
First we remove f;(x) from q(x) by division, which yields g_;(x) = NM (m_,, v_;I), where 
vl = viv" (22.89) 
m; = m+ viv; (m — m;) (22.90) 
The normalization constant is given by 


Next we compute q”°™ (x) by computing the mean and variance of g—i(x)f;(x) as follows: 


Vi 
= —i + pi j ag 22.92 
m m p E iy mi) ( ) 
v, vžlly: — mul? 
= hy P = + pi(1 i) 22.93 
v v P vi + 1 P ( P ) D(v_; + 1)2 ( ) 
p = is ZN (y;|0, al) (22.94) 


where D is the dimensionality of x and p; can be interpreted as the probability that y; is not 
clutter. 7 
Finally, we compute the new factor f; whose parameters are given by 


vy = v-o] (22.95) 

m = m; + (vi + v)u] (m = mi) (22.96) 
Zi 

Si = (22.97) 


(27v;)P/2N(m;|m_;, (vi + v_;)1) 


4. For a handy “crib sheet”, containing many of the standard equations needed for deriving Gaussian EP algorithms, see 
http: //research.microsoft.com/en-us/um/people/minka/papers/ep/minka-ep-quickref.pdf. 
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At convergence, we can approximate the marginal likelihood as follows: 


p(D) 


Q 


N 
(2rv)P/? exp(c/2) JI si(2rv;) P? (22.98) 
i=1 


a mm = mlm; 
a 5 (22.99) 


Ui 


In (Minka 2001d), it is shown that, at least on this example, EP gives better accuracy per unit 
of CPU time than VB and MCMC. 


LBP is a special case of EP 


We now show that loopy belief propagation is a special case of EP, where the base distribution 
contains the node marginals and the “intractable” terms correspond to the edge potentials. We 
assume the model has the pairwise form shown in Equation 22.12. If there are m nodes, the 
base distribution takes the form 


P(x|91,...,8m,0) x || exp(6,(zs)) (22.100) 
sEV 


The entropy of this distribution is simply 


(Tim) =D at Ts) (22.101) 


If we add in the u — v edge, the ®.,,, augmented distribution has the form 


P(X|O1:m; uv) < [H 0. exp(0 eo) exp(Ouy (Lu; £v)) (22.102) 
sEV 


Since this graph is a tree, the exact entropy of this distribution is given by 


H (Trim tan) = SHG) — Tw) (22.103) 
where I(Tuv) = H (Tu) + H (To) — H (Tuv) is the mutual information. Thus the EP approxi- 
mation to the entropy of the full distribution is given by 

H.p(7,7) = H(r)+ XO [Wl (tim;Fuv) — H(r)] (22.104) 

(u,v)EE 
= XOH (Ts) + 5 >H (Ta) =a T) (22.105) 
s (uvJEE L s 
= JS H(r.)— O I(Fuv) (22.106) 
s (u,v)EE 


which is precisely the Bethe approximation to the entropy. 
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We now show that the convex set that EP is optimizing over, £(@, ®) given by Equation 22.73, 
is the same as the one that LBP is optimizing over, L(G) given in Equation 22.33. First, let us 
consider the set M(@). This consists of all marginal distributions (7,,s € V), realizable by 
a factored distribution. This is therefore equivalent to the set of all distributions which satisfy 
non-negativity 7,(%;) > 0 and the local normalization constraint } 7... T(x.) = 1. Now consider 
the set M(@, uv) for a single u—v edge. This is equivalent to the marginal polytope M(Guv), 
where G,,, is the graph with the single u — v edge added. Since this graph corresponds to a 
tree, this set also satisfies the marginalization conditions 


5 Tirali Ly) = Talu) 5 Taulu; Ta) =Ty (ay) (22.107) 


Since £(@, ®) is the union of such sets, as we sweep over all edges in the graph, we recover 
the same set as L(G). 

We have shown that the Bethe approximation is equivalent to the EP approximation. We now 
show how the EP algorithm reduces to LBP. Associated with each intractable term i = (u,v) 
will be a pair of Lagrange multipliers, (Auv (av), Avu(u))- Recalling that 07 (x) = [5 (2s)]s, 
the base distribution in Equation 22.78 has the form 


q(x|O,A) x JJexp(@s(as)) [| expQue(av) + Avu(@u)) (22.108) 
8 (u,v)EE 

= ee 95 (2s) +> Nts (as) (22.109) 
tEN (s) 


Similarly, the augmented distribution in Equation 22.79 has the form 
quv(X|0, A) œ q(x|0, A) exp (Buv (Tu, £u) — Auv (Tv) — Avu (Tu) ) (22.110) 
We now need to update Tu (£u) and To (£y) to enforce the moment matching constraints: 


(Eq [xs] , Eq [ze]) = (Equs [£8], Equo [xe]) (22.111) 


It can be shown that this can be done by performing the usual sum-product message passing 
step along the u — v edge (in both directions), where the messages are given by Muv(£u) = 
exp(Auv(£v)), and Myu(@u) = exp(Avu(Lu)). Once we have updated q, we can derive the 
corresponding messages Auv and Ayu- 

The above analysis suggests a natural extension, where we make the base distribution be a 
tree structure instead of a fully factored distribution. We then add in one edge at a time, absorb 
its effect, and approximate the resulting distribution by a new tree. This is known as tree EP 
(Minka and Qi 2003), and is more accurate than LBP, and sometimes faster. By considering other 
kinds of structured base distributions, we can derive algorothms that outperform generalization 
belief propagation (Welling et al. 2005). 


Ranking players using TrueSkill 


We now present an interesting application of EP to the problem of ranking players who compete 
in games. Microsoft uses this method — known as TrueSkill (Herbrich et al. 2007) — to rank 


794 Chapter 22. More variational inference 


Figure 22.11 (a) A DGM representing the TrueSkill model for 4 players and 3 teams, where team 1 is player 
1, team 2 is players 2 and 3, and team 3 is player 4. We assume there are two games, team 1 vs team 2, 
and team 2 vs team 3. Nodes with double circles are deterministic. (b) A factor graph representation of the 
model where we assume there are 3 players (and no teams). There are 2 games, player 1 vs player 2, and 
player 2 vs player 3. The numbers inside circles represent steps in the message passing algorithm. 
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players who use the Xbox 360 Live online gaming system; this system process over 10° games 
per day, making this one of the largest application of Bayesian statistics to date. The same 
method can also be applied to other games, such as tennis or chess.° 

The basic idea is shown in Figure 22.1l(a). We assume each player i has a latent or true 
underlying skill level s; € IR. These skill levels can evolve over time according to a simple 
dynamical model, p(s‘|si~') = N(s¢|s'~', 77). In any given game, we define the performance 
of player i to be p;, which has the conditional distribution p(p;|s;) = M (pilsi, 87). We then 
define the performance of a team to be the sum of the performance of its constituent players. 
For example, in Figure 22.1l(a), we assume team 2 is composed of players 2 and 3, so we define 
t2 = po + p3. Finally, we assume that the outcome of a game depends on the difference in 
performance levels of the two teams. For example, in Figure 22.1l(a), we assume yı = sign(d}), 
where dı = tı — t2, and where yı = +1 means team 1 won, and yı = —1 means team 2 won. 
Thus the prior probability that team 1 wins is 


ply = +1|s) = [va = Olti, t2)p(tı|sı)p(t2|s2)dtıdt2 (22.112) 


where tı ~ N (s1, 87) and t2 ~ N (s2 + 83, B?).’ 

To simplify the presentation of the algorithm, we will ignore the dynamical model and assume 
a common static factored Gaussian prior, N (uo, aa ), on the skills. Also, we will assume that 
each team consists of 1 player, so t; = p;, and that there can be no ties. Finally, we will integrate 
out the performance variables p;, and assume 8? = 1, leading to a final model of the form 


p(s) = |] AN (si\u0, 07) (22.113) 
p(dgls) = N(dglsi, — Sj, 1) (22.114) 
P(Y¥gldg) = Tyg =sign(dg)) (22.115) 


where i, is the first player of game g, and j, is the second player. This is represented in 
factor graph form in in Figure 22.11(b). We have 3 kinds of factors: the prior factor, f;(s;) = 
N (siluo, 0), the game factor, hg(s;,,5;,,dg) = N(dg|si, — j,, 1), and the outcome factor, 
kg(dg, Yg) = Wy, = signid,)). 

Since the likelihood term (y,|d,) is not conjugate to the Gaussian priors, we will have to 
perform approximate inference. Thus even when the graph is a tree, we will need to iterate. 
(If there were an additional game, say between player 1 and player 3, then the graph would no 
longer be a tree.) We will represent all messages and marginal beliefs by 1d Gaussians. We will 
use the notation u and v for the mean and variance (the moment parameters), and A = 1/v 
and 7 = A for the precision and precision-adjusted mean (the natural parameters). 


5. Naive Bayes classifiers, which are widely used in spam filters, are often described as the most common application 
of Bayesian methods. However, the parameters of such models are usually fit using non-Bayesian methods, such as 
penalized maximum likelihood. 

6. Our presentation of this algorithm is based in part on lecture notes by Carl Rasmussen Joaquin Quinonero-Candela, 
available at http: //mlg.eng.cam.ac.uk/teaching/4f13/1112/lect13.pdf. 

7. Note that this is very similar to probit regression, discussed in Section 9.4, except the inputs are (the differences of) 
latent 1 dimensional factors. If we assume a logistic noise model instead of a Gaussian noise model, we recover the 
Bradley Terry model of ranking. 
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We initialize by assuming that at iteration 0, the initial upward messages from factors hg to 
variables s; are uniform, i.e., 


Mh, +9;, (Sig) = 1, Ake; =) Thyra, = 9 (22.116) 


and similarly Mh, = (s;,) = 1. The messages passing algorithm consists of 6 steps per game, 


as illustrated in Figure 22.11(b). We give the details of these steps below. 


1. Compute the posterior over the skills variables: 


g (si) f(s: II mits, (51) = Ne(silni, A!) (22.117) 


y= Dr bar amet a 2.8) 


2. Compute the message from the skills variables down to the game factor hy: 


tla. tg. 
M oc fees ia E (22.119) 


io >h ja >h 
i B Migs (Sig) ? Mary TY, eg. (sj, ) 
where the division is implemented by subtracting the natural parameters as follows: 
t t t t t 
Ns, ig>hg — = Asi, — ri, gig > Msi hg _ Msi, ~~ Nhg>sig (22.120) 


and similarly for s;,. 


3. Compute the message from the game factor hg down to the difference variable dy: 


Mh a, (dg) = J [Paliou sim, -shg (Sig) Msj, -hg ($5, )48i, 48;, (22.121) 
= f | Medals, = si DN (sishi ongian) (22.122) 

N (83, llls; hg Vsi, +h, )48i, 155, (22.123) 

= N(dgltj,,-+a,Uhg—d,) (22.124) 

Un dy = Ty a, T Ushi (22.125) 
hardy = Mei, +hy T Hay, >ho (22.126) 


4. Compute the posterior over the difference variables: 


R 


q (dg) Mh a, (dg)Mk, >a, (dg) (22.127) 


= N (dolhh, >d, Uh, >a )I(yg = sign(dg)) (22.128) 


g 


N (dlug Ug) (22.129) 


Q 
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W function A function 


(a) (b) 
Figure 22.12 (a) Y function. (b) A function. Based on Figure 2 of (Herbrich et al. 2007). Figure generated 


by trueskillPlot. 


(Note that the upward message from the k, factor is constant.) We can find these parameters 
by moment matching as follows: 


t 
Yon, dy 
Hg = Yohhy>dy + Ohg>d, Y (=s) (22.130) 
hg—dg 
t 
u = Uh xg. pa (epee) (22.131) 
hg—>dyg 
0,1 
U(x) £ SH (22.132) 
A(z) = W(x)(U(x) +2) (22.133) 


(The derivation of these equations is left as a modification to Exercise 11.15.) These functions 
are plotted in Figure 22.12. Let us try to understand these equations. Suppose Hh, +d, is a 
large positive number. That means we expect, based on the current estimate of the skills, 
that d, will be large and positive. Consequently, if we observe y, = +1, we will not be 
surprised that 7, is the winner, which is reflected in the fact that the update factor for the 
mean is small, V(Yghth, = a) = 0. Similarly, the update factor for the variance is small, 
A(YoHti,, +a, ) = 0. However, if we observe yg = —1, then the update factor for the mean 
and variance becomes quite large. 


5. Compute the upward message from the difference variable to the game factor hg: 


t 
q (dy) 

m,->h,(dg) = a) z A) (22.134) 
ghg 


Aasin E As = ae Ng thn = Ng = ree (22.135) 


6. Compute the upward messages from the game factor to the skill variables. Let us assume 
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Figure 22.13 
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(a) A DAG representing a partial ordering of players. (b) Posterior mean plus/minus 1 standard 


deviation for the latent skills of each player based on 26 games. Figure generated by trueskillDemo. 


that ig is the winner, and jg is the loser. Then we have 


Thee, (si) 


t 
Ung Sig 


t 
Hhg>sig 


And similarly 
t 
Mhg—>sjg (sj) 


t 
Uhg>sjg 


t 
Hhg>sjg 


When we compute 


= JJ” (dg, Sig» Sj Ma, hg (dg)m ms, hy (Sjo )ddgds;, 


N (si luh g Sig? Uh, se 
= 220) oe ts) thy 


t t 
= Mayon, + Hs; >hg 


= J frs (dgs 5:,,83,)!m', sn, (dg), sn, (Si, )ddgdsi, 


= N (sj lHh, 85g? Uh, =e 
= 14+44,4n, + Ur Sey 


t t 
= Hdg+hg — Hsi, >hg 


(22.136) 


(22.137) 
(22.138) 
(22.139) 


(22.140) 


(22.141) 
(22.142) 
(22.143) 


q'**(s;,) at the next iteration, by combining Mh, ETN io ) with the 


prior factor, we will see that the posterior mean of s;, goes up. Similarly, the posterior mean 


of sj, goes down. 


It is straightforward to combine EP with ADF to perform online inference, which is necessary 
for most practical applications. 


Let us consider a simple example of this method. We create a partial ordering of 5 players 
as shown in Figure 22.13(a). We then sample some game outcomes from this graph, where a 


22.5.6 


22.6 
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parent always beats a child. We pass this data into (5 iterations of) the EP algorithm and infer the 
posterior mean and variance for each player's skill level. The results are shown in Figure 22.13(b). 
We see that the method has correctly inferred the rank ordering of the players. 


Other applications of EP 


The TrueSkill model was developed by researchers at Microsoft. They and others have extended 
the model to a variety of other interesting applications, including personalized ad recommenda- 
tion (Stern et al. 2009), predicting click-through-rate on ads in the Bing search engine (Graepel 
et al. 2010), etc. They have also developed a general purpose Bayesian inference toolbox based 
on EP called infer.net (Minka et al. 2010). 

EP has also been used for a variety of other models, such as Gaussian process classification 
(Nickisch and Rasmussen 2008). See http: //research.microsoft.com/en-us/um/people/ 
minka/papers/ep/roadmap.html1 for a list of other EP applications. 


MAP state estimation 


In this section, we consider the problem of finding the most probable configuration of variables 
in a discrete-state graphical model, i.e., our goal is to find a MAP assignment of the following 
form: 


ko LEA oe = z T 
x" = arg max p(x|0) = arg max 2 6; (ai) + 26 (xp) = arg max O° (x) (22.144) 


where 0; are the singleton node potentials, and 0; are the factor potentials. (In this section, we 
follow the notation of (Sontag et al. 2011), which considers the case of general potentials, not just 
pairwise ones.) Note that the partition function Z7(0) plays no role in MAP estimation. 

If the treewidth is low, we can solve this problem with the junction tree algorithm (Sec- 
tion 20.4), but in general this problem is intractable. In this section, we discuss various approxi- 
mations, building on the material from Section 22.3. 


Linear programming relaxation 


We can rewrite the objective in terms of the variational parameters as follows: 


T T 
arg max 0- (x) = arg eo j Op (22.145) 
where (x) = HI(xs = j)},{I(xy = k)}) and wp is a probability vector in the marginal 
polytope. To see why this equation is true, note that we can just set u to be a degenerate 
distribution with u(x) = 1(a, = x¥), where x% is the optimal assigment of node s. So instead 
of optimizing over discrete assignments, we now optimize over probability distributions p. 

It seems like we have an easy problem to solve, since the objective in Equation 22.145 is linear 
in p, and the constraint set M(G') is convex. The trouble is, M(G) in general has a number of 
facets that is exponential in the number of nodes. 

A standard strategy in combinatorial optimization is to relax the constraints. In this case, 
instead of requiring probability vector js to live in the marginal polytope M(G), we allow it to 
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live inside a convex outer bound L(G). Having defined this relaxed constraint set, we have 


max 07 (x) = max OT < max 60’ r (22.146) 
keam uEM(G) TEL(G) 
If the solution is integral, it is exact; if it is fractional, it is an approximation. This is called 
a (first order) linear programming relaxtion. The reason it is called first-order is that the 
constraints that are enforced are those that correspond to consistency on a tree, which is a 
graph of treewidth 1. It is possible to enforce higher-order consistency, using graphs with larger 
treewidth (see (Wainwright and Jordan 2008b, sec 8.5) for details). 

How should we actually perform the optimization? We can use a generic linear programming 
package, but this is often very slow. Fortunately, in the case of graphical models, it is possible to 
devise specialised distributed message passing algorithms for solving this optimization problem, 
as we explain below. 


Max-product belief propagation 


The MAP objective in Equation 22.145, max,<mq) 6" p, is almost identical to the inference 
objective in Equation 22.23, max„eMm(G) 67. +H (s), apart from the entropy term. One 
heuristic way to proceed would be to consider the zero temperature limit of the probability 
distribution yz, where the probability distribution has all its mass centered on its mode (see 
Section 4.2.2). In such a setting, the entropy term becomes zero. We can then modify the 
message passing methods used to solve the inference problem so that they solve the MAP 
estimation problem instead. In particular, in the zero temperature limit, the sum operator 
becomes the max operator, which results in a method called max-product belief propagation. 
In more detail, let 


a T 
A(0) = eo 0 w+H (pn) (22.147) 


Now consider an inverse temperature 8 going to infinity. We have 


_  A(66@) a T 
lim S = lim = 0 H 22.148 
í 
= oT lim -H 22.149 
Bit e+ ie, gH} ea. 
= max "p (22.150) 
HwEM(G) 


It is the concavity of the objective function that allows us to interchange the lim and max 
operators (see (Wainwright and Jordan 2008b, p274) for details). 

Now consider the Bethe approximation, which has the form max,e¢zq) oTr + Hpethe(T). 
We showed that loopy BP finds a local optimum of this objective. In the zero temperature limit, 
this objective is equivalent to the LP relaxation of the MAP problem. Unfortunately, max-product 
loopy BP does not solve this LP relaxation unless the graph is a tree (Wainwright and Jordan 
2008b, p211). The reason is that Bethe energy functional is not concave (except on trees), so we 
are not licensed to swap the limit and max operators in the above zero-temperature derivation. 
However, if we use tree-reweighted BP, or TRBP/ TRW, we have a concave objective. In this case, 
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one can show (Kolmogorov and Wainwright 2005) that the max-product version of TRBP does 
solve the above LP relaxation. 

A certain scheduling of this algorithm, known as sequential TRBP, TRBP-S, or TRW-S, can 
be shown to always converge (Kolmogorov 2006), and furthermore, it typically does so faster 
than the standard parallel updates. The idea is to pick an arbitrary node ordering X,,..., Xy. 
We then consider a set of trees which is a subsequence of this ordering. At each iteration, we 
perform max-product BP from X, towards Xy and back along one of these trees. It can be 
shown that this monotonically minimizes a lower bound on the energy, and thus is guaranteed 
to converge to the global optimum of the LP relaxation. 


Graphcuts 


In this section, we show how to find MAP state estimates, or equivalently, minimum energy 
configurations, by using the max flow/min cut algorithm for graphs. This class of methods is 
known as graphcuts and is very widely used, especially in computer vision applications. 

We will start by considering the case of MRFs with binary nodes and a restricted class of 
potentials; in this case, graphcuts will find the exact global optimum. We then consider the 
case of multiple states per node, which are assumed to have some underlying ordering; we can 
approximately solve this case by solving a series of binary subproblems, as we will see. 


Graphcuts for the generalized Ising model 
Let us start by considering a binary MRF where the edge energies have the following form: 


0 if ty = Ly 


Fie Gaye) = { Ast if vy + Ly (22.151) 


where Ast > 0 is the edge cost. This encourages neighboring nodes to have the same value 
(since we are trying to minimize energy). Since we are free to add any constant we like to the 
overall energy without affecting the MAP state estimate, let us rescale the local energy terms 
such that either Ea (1) = 0 or E,,(0) = 0. 

Now let us construct a graph which has the same set of nodes as the MRF, plus two distin- 
guished nodes: the source s and the sinkt. If E,,(1) = 0, we add the edge x, — t with cost 
£.,(0). (This ensures that if u is not in partition ¥,, meaning u is assigned to state 0, we will 
pay a cost of £,,(0) in the cut.) Similarly, If £,,(0) = 0, we add the edge x, — s with cost 
E,„(1). Finally, for every pair of variables that are connected in the MRF, we add edges x, > xy 
and £y — Zu, both with cost Au,» > 0. Figure 22.14 illustrates this construction for an MRF 
with 4 nodes, and with the following non-zero energy values: 


E\(0) = 7,E2(1) = 2, £3(1) = 1, E4(1) = 6 (22.152) 
A122 = 6,A2,3 = 6,A3,4 = 2,A1,4 = 1 (22.153) 


Having constructed the graph, we compute a minimal s — t cut. This is a partition of the nodes 
into two sets, X., which are nodes connected to s, and ;, which are nodes connected to t. We 


8. There are a variety of ways to implement this algorithm, see e.g., (Sedgewick and Wayne 2011). The best take 
O(EV log V) or O(V?) time, where E is the number of edges and V is the number of nodes. 
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Figure 22.14 Illustration of graphcuts applied to an MRF with 4 nodes. Dashed lines are ones which 
contribute to the cost of the cut (for bidirected edges, we only count one of the costs). Here the min cut 
has cost 6. Source: Figure 13.5 from (Koller and Friedman 2009). Used with kind permission of Daphne 
Koller. 


pick the partition which minimizes the sum of the cost of the edges between nodes on different 
sides of the partition: 


cost( Xs, Xi) = 5 cost(Zus Sy) (22.154) 
Lu EXs, Ey E Xt 


In Figure 22.14, we see that the min-cut has cost 6. 

Minimizing the cost in this graph is equivalent to minimizing the energy in the MRF. Hence 
nodes that are assigned to s have an optimal state of 0, and the nodes that are assigned to t 
have an optimal state of 1. In Figure 22.14, we see that the optimal MAP estimate is (1, 1, 1, 0). 


Graphcuts for binary MRFs with submodular potentials 


We now discuss how to extend the graphcuts construction to binary MRFs with more general 
kinds of potential functions. In particular, suppose each pairwise energy satisfies the following 
condition: 


Eyy(1,1) + Ew (0,0) < Eyy(1,0) + Buv(0, 1) (22.155) 


In other words, the sum of the diagonal energies is less than the sum of the off-diagonal energies. 
In this case, we say the energies are submodular (Kolmogorov and Zabin 2004)? An example 
of a submodular energy is an Ising model where A„» > 0. This is also known as an attractive 
MRE or associative MRF, since the model “wants” neighboring states to be the same. 


9. Submodularity is the discrete analog of convexity. Intuitively, it corresponds to the “law of diminishing returns”, that 
is, the extra value of adding one more element to a set is reduced if the set is already large. More formally, we say that 
f : 25 — R is submodular if for any A C B C S and x € S, we have f(AU {x}) — f(A) > f(BU {2}) — f(B). 
If — f is submodular, then f is supermodular. 
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To apply graphcuts to a binary MRF with submodular potentials, we construct the pairwise 
edge weights as follows: 


E; „(0, 1) = Eu, (1, 0) + Euv (0, 1) — Euu (0,0) — Eu, (1,1) (22.156) 


This is guaranteed to be non-negative by virtue of the submodularity assumption. In addition, 
we construct new local edge weights as follows: first we initialize E’(w) = E(u), and then for 
each edge pair (u,v), we update these values as follows: 


E) = ELA) + (Zu(1, 0) — Eu,»(0,0)) (22.157) 
Fx) = £,(1)+ (Eu (1, 1) — Eu,v(1,0)) (22.158) 


We now construct a graph in a similar way to before. Specifically, if Æ/ (1) > E%,(0), we 
add the edge u — s with cost E/,(1) — E’,(0), otherwise we add the edge u — t with cost 
E! (0) — E; (1). Finally for every MRF edge for which E’, „(0,1) > 0, we add a graphcuts edge 
Ly, — Ly with cost E’, ,(0,1). (We don't need to add the edge in both directions.) 

One can show (Exercise 22.1) that the min cut in this graph is the same as the minimum 
energy configuration. Thus we can use max flow/min cut to find the globally optimal MAP 
estimate (Greig et al. 1989). 


Graphcuts for nonbinary metric MRFs 


We now discuss how to use graphcuts for approximate MAP estimation in MRFs where each 
node can have multiple states (Boykov et al. 2001). However, we require that the pairwise energies 
form a metric. We call such a model a metric MRF. For example, suppose the states have a 
natural ordering, as commonly arises if they are a discretization of an underlying continuous 
space. In this case, we can define a metric of the form E(z£s, £4) = min(0d, ||, — x;||) or a 
semi-metric of the form E(x, x4) = min(6, (£s — x+)*), for some constant 6 > 0. This energy 
encourages neighbors to have similar labels, but never “punishes” them by more than 6. This 6 
term prevents over-smoothing, which we illustrate in Figure 19.20. 

One version of graphcuts is the alpha expansion. At each step, it picks one of the available 
labels or states and calls it a; then it solves a binary subproblem where each variable can choose 
to remain in its current state, or to become state a (see Figure 22.15(d) for an illustration). More 
precisely, we define a new MRF on binary nodes, and we define the energies of this new model, 
relative to the current assignment x, as follows: 


Fi,(0) = Ey (£u), E, (1) = Eu (a), Ey (0,0) = Euv (£u, £v) (22.159) 
Bs (0, 1) = Euo(Lu, a), Eral, 0) = Eux (Q, Ly), Baily 1) = Eula, a) (22.160) 


To optimize E’ using graph cuts (and thus figure out the optimal alpha expansion move), 
we require that the energies be submodular. Plugging in the definition we get the following 
constraint: 


Ena (Lu, Ly) + Eu o(a, a) < Euv (Bus a) + Eu wla, Zu) (22.161) 


For any distance function, Eu „(@,aœa) = 0, and the remaining inequality follows from the 
triangle inequality. Thus we can apply the alpha expansion move to any metric MRF. 
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(a) initial labeling (b) standard move (c) a-G-swap (d) a-expansion 


Figure 22.15 (a) An image with 3 labels. (b) A standard local move (e.g., by iterative conditional modes) 
just flips the label of one pixel. (c) An œ — 8 swap allows all nodes that are currently labeled as a to 
be relabeled as 8 if this decreases the energy. (d) An a expansion allows all nodes that are not currently 
labeled as a to be relabeled as a if this decreases the energy. Source: Figure 2 of (Boykov et al. 2001). 
Used with kind permission of Ramin Zabih. 


At each step of alpha expansion, we find the optimal move from amongst an exponentially 
large set; thus we reach a strong local optimum, of much lower energy than the local optima 
found by standard greedy label flipping methods such as iterative conditional modes. In fact, 
one can show that, once the algorithm has converged, the energy of the resulting solution is at 
most 2c times the optimal energy, where 


Oe Eiio ? 
c= max Mo ulap) (22.162) 


(u,v)EE minag Ewu (a, B) 


See Exercise 22.3 for the proof. In the case of the Potts model, c = 1, so we have a 2- 
approximation. 

Another version of graphcuts is the alpha-beta swap. At each step, two labels are chosen, 
call them a and 8. All the nodes currently labeled œ can change to 8 (and vice versa) if this 
reduces the energy (see Figure 22.15(c) for an illustration). The resulting binary subproblem can 
be solved exactly, even if the energies are only semi-metric (that is, the triangle inequality need 
not hold; see Exercise 22.2). Although the a — 8 swap version can be applied to a broader class 
of models than the a-expansion version, it is theoretically not as powerful. Indeed, in various 
low-level vision problems, (Szeliski et al. 2008) show empirically that the expansion version is 
usually better than the swap version (see Section 22.6.4). 


Experimental comparison of graphcuts and BP 


In Section 19.6.2.7, we described lattice-structured CRFs for various low-level vision problems. 
(Szeliski et al. 2008) performed an extensive comparison of different approximate optimization 
techniques for this class of problems. Some of the results, for the problem of stereo depth 
estimation, are shown in Figure 22.16. We see that the graphcut and tree-reweighted max- 
product BP (TRW) give the best results, with regular max-product BP being much worse. In terms 
of speed, graphcuts is the fastest, with TRW a close second. Other algorithms, such as ICM, 
simulated annealing or a standard domain-specific heuristic known as normalize correlation, are 
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Figure 22.16 Energy minimization on a CRF for stereo depth estimation. Top row: two input images along 
with the ground truth depth values. Bottom row: energy vs time for 4 different optimization algorithms. 
Bottom left: results are for the Teddy image (shown in top row). Bottom right: results are for the Tsukuba 
image (shown in Figure 22.17(a)). Source: Figure 13.B.1 of (Koller and Friedman 2009). Used with kind 
permission of Daphne Koller. 


even worse, as shown qualitatively in Figure 22.17. 

Since TRW is optimizing the dual of the relaxed LP problem, we can use its value at conver- 
gence to evaluate the optimal energy. It turns out that for many of the images in the stereo 
benchmark dataset, the ground truth has higher energy (lower probability) than the globally op- 
timal estimate (Meltzer et al. 2005). This indicates that we are optimizing the wrong model. This 
is not surprising, since the pairwise CRF ignores known long-range constraints. Unfortunately, 
if we add these constraints to the model, the graph either becomes too dense (making BP slow), 
and/or the potentials become non-submodular (making graphcuts inapplicable). 

One way around this is to generate a diverse set of local modes, using repeated applications 
of graph cuts, as described in (Yadollahpour et al. 2011). We can then apply a more sophisticated 
model, which uses global features, to rerank the solutions. 
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(a) Left image: 384x288, 15 labels (b) Ground truth 


(c) Swap algorithm (d) Expansion algorithm 


(e) Normalized correlation (£) Simulated annealing 


Figure 22.17 An example of stereo depth estimation using an MRF. (a) Left image, of size 384 x 288 
pixels, from the University of Tsukuba. (The corresponding right image is similar, but not shown.) (b) 
Ground truth depth map, quantized to 15 levels. (c-f): MAP estimates using different methods: (c) a — 8 
swap, (d) œ expansion, (e) normalized cross correlation, (f) simulated annealing. Source: Figure 10 of 
(Boykov et al. 2001). Used with kind permission of Ramin Zabih. 


Dual decomposition 
We are interested in computing 


* 


p* = max ) 0i(2:) + X 5 (xy) (22.163) 
ieV JEF 


where F represents a set of factors. We will assume that we can tractably optimize each local 
factor, but the combination of all of these factors makes the problem intractable. One way to 
proceed is to optimize each term independently, but then to introduce constraints that force all 
the local estimates of the variables’ values to agree with each other. We explain this in more 
detail below, following the presentation of (Sontag et al. 2011). 
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(es, x3) f f : 
3 / DO @@ 


Figure 22.18 (a) A pairwise MRF with 4 different edge factors. (b) We have 4 separate variables, plus a 
copy of each variable for each factor it participates in. Source: Figure 1.2-1.3 of (Sontag et al. 2011). Used 
with kind permission of David Sontag. 


22.6.5.1 Basic idea 


Let us duplicate the variables x;, once for each factor, and then force them to be equal. 
Specifically, let xi = {zf tic be the set of variables used by factor f. This construction is 
illustrated in Figure 22.18. We can reformulate the objective as follows: 


p* = max y 6;(a;) + 5 0 ¢(x}) s.t. a! =a, Vf,ie f (22.164) 


Let us now introduce Lagrange multipliers, or dual variables, 5 ilk), to enforce these constraints. 
The Lagrangian becomes 


L(ô,x, xf) = X oil) + YO ot) (22.165) 
iEV fer 
+Y Y p(B) (I: = #;) -Iaf = 2)) (22.166) 
fEFief ĉi 


This is equivalent to our original problem in the following sense: for any value of 6, we have 


p= = max L(6, x, x!) s.t. at =a; Vf,icef (22.167) 


x,xf 


since if the constraints hold, the last term is zero. We can get an upper bound by dropping the 
consistency constraints, and just optimizing the following upper bound: 


L(6) £ max L(, x, x’) (22.168) 
= J max 0i( +> Spil) O 0+ (Xf) DIA xi) | (22.169) 
i fief iEf 


See Figure 22.19 for an illustration. 


22.6.5.2 


808 Chapter 22. More variational inference 


“Os (%1, £2) — õp (21) by(x2) 


: : 21 te alza) ™ 
a n Sie @ Orla, h 
— 593 (23) ji i i i 5 pi 
— 641 (1) Bona : — dn2(%2) 


g: Spa (4) Q — fral (za) 


bates) © alaa) 


e—@ 


Mee, 0123, z4a)— dks (£3)—őka (aa) 


Figure 22.19 Illustration of dual decomposition. Source: Figure 1.2 of (Sontag et al. 2011). Used with 
kind permission of David Sontag. 


This objective is tractable to optimize, since each x+ term is decoupled. Furthermore, we see 
that L(6) > p*, since by relaxing the consistency constraints, we are optimizing over a larger 
space. Furthermore, we have the property that 


min L(6) = p* (22.170) 


so the upper bound is tight at the optimal value of 6, which enforces the original constraints. 

Minimizing this upper bound is known as dual decomposition or Lagrangian relaxation 
(Komodakis et al. 2011; Sontag et al. 2011; Rush and Collins 2012). Furthemore, it can be shown 
that L(6) is the dual to the same LP relaxation we saw before. We will discuss several possible 
optimization algorithms below. 

The main advantage of dual decomposition from a practical point of view is that it allows 
one to mix and match different kinds of optimization algorithms in a convenient way. For 
example, we can combine a grid structured graph with local submodular factors to perform 
image segmentation, together with a tree structured model to perform pose estimation (see 
Exercise 22.4). Analogous methods can be used in natural language processing, where we often 
have a mix of local and global constraints (see e.g., (Koo et al. 2010; Rush and Collins 2012)). 


Theoretical guarantees 


What can we say about the quality of the solutions obtained in this way? To understand this, let 
us first introduce some more notation: 


Bi (ai) 2 ale) Y drla) (22.171) 
fief 
Dis) £ Alx) Y fi (as) (22.172) 


ief 
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This represents a reparameterization of the original problem, in the sense that 
ő $ 
Nalt 5° Oa) = AO (wi) YA E) (22.173) 
i f i f 


and hence 


5 <6 
Hey. max 9; (xi) + 5 max 6 (Xp) (22.174) 


Now suppose there is a set of dual variables 6° and an assignment x* such that the maxi- 
mizing assignments to me singleton terms agrees with oe assignments to the factor terms, i.e., 


so that 7} € argmax,,. a (xi) and x} € argmax,, E (x;). In this case, we have 
5°) = S78) (a? a =e y+ S85 (x5) (22.175) 
i f 


Now since 


Na A O) < p* < L(S*) (22.176) 
i f 
we conclude that L(6*) = p*, so x* is the MAP assignment. 
So if we can find a solution where all the subproblems agree, we can be assured that it is the 
global optimum. This happens surprisingly often in practical problems. 


Subgradient descent 


_§ 
L(6) is a convex and continuous objective, but it is non-differentiable at points 6 where 0; (x;) 


—ô 
or 0p(xp) have multiple optima. One approach is to use subgradient descent. This updates all 
the elements of 6 at the same time, as follows: 


Opt (ai) = Silti) — cng; (wi) (22.177) 


where gt the subgradient of L(6) at 6°. If the step sizes a; are set appropriately (see Sec- 
tion 8.5.2.1), this method is guaranteed to converge to a global optimum of the dual. (See 
(Komodakis et al. 2011) for details.) 

One can sa that the gradient is euen by the following sparse vector. First let x? € 
Arr a fcr and xí € argmax,. OF (xy). Next let gri(x;) = 0 for all elements. Finally, 
if a! Æ xf (so factor f disagrees with the local term on how to set variable i), we set gp;(x?) = 

_ gt _ st 

+1 and grilat) = —1. This has the effect of decreasing g? (x$) and increasing g? (x1), 
bringing them closer to agreement. Similarly, the subgradient update will decrease the value of 
—ôt _ st 
oP (af, X,\;) and increasing the value of Oy (£, Xfi) 

To compute the gradient, we need to be able to solve subproblems of the following form: 

ôt 
argmax 0 (xf) = argmax | 0¢(x,) DDA (xi) (22.178) 
Xf Xf icf 
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(In (Komodakis et al. 2011), these subproblems are called slaves, whereas L(6) is called the 
master.) Obviously if the scope of factor f is small, this is simple. For example, if each factor is 
pairwise, and each variable has K states, the cost is just K?. However, there are some kinds of 
global factors that also support exact and efficient maximization, including the following: 


e Graphical models with low tree width. 


e Factors that correspond to bipartite graph matchings (see e.g., (Duchi et al. 2007)). This 
is useful for data association problems, where we must match up a sensor reading with 
an unknown source. We can find the maximal matching using the so-called Hungarian 
algorithm in O(| f|’) time (see e.g., (Padadimitriou and Steiglitz 1982). 


e Supermodular functions. We discuss this case in more detail in Section 22.6.3.2. 


e Cardinality constraints. For example, we might have a factor over a large set of binary 
variables that enforces that a certain number of bits are turned on; this can be useful in 
problems such as image segmentation. In particular, suppose 6(x 7) = 0 if D0. sual 
and 6 f(x /) = —oo otherwise. We can find the maximizing assignment in O(|f| log|f|) 
time as follows: first define e; = 6/;(1) — 6; (0); now sort the e;; finally set x; = 1 for the 
first L values, and x; = 0 for the rest (Tarlow et al. 2010). 


e Factors which are constant for all but a small set S of distinguished values of x. Then we 
can optimize over the factor in O(|.S|) time (Rother et al. 2009). 


Coordinate descent 


An alternative to updating the entire 6 vector at once (albeit sparsely) is to update it using block 
coordinate descent. By choosing the size of the blocks, we can trade off convergence speed with 
ease of the local optimization problem. 

One approach, which optimizes 6,;(x;) for all i € f and all x; at the same time (for a 
fixed factor f), is known as max product linear programming (Globerson and Jaakkola 2008). 
Algorithmically, this is similar to belief propagation on a factor graph. In particular, we define 
df—+i aS messages sent from factor f to variable i, and we define 6;_, as messages sent from 
variable 7 to factor f. These messages can be computed as follows (see (Globerson and Jaakkola 
2008) for the derivation):! 


Oi p(x) = Q; (xi) + 5 bg (Zi) (22.179) 
9Af 
1 
O f—i(@i) = —Oj- (Xi) + Tl | 0 (xf) + 5 dj 4 (25) (22.180) 
Xf\i ; 
Jef 


We then set the dual variables 6;;(x;) to be the messages 5 /_,;(2;). 
For example, consider a 2 x 2 grid MRF, with the following pairwise factors: 0;(x1, 22), 
64(#1, £3), On(T2, £4), and 6;,(73, £4). The outgoing message from factor f to variable 2 is a 


10. Note that we denote their 6; (ai) by dj ¢ (xi). 
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function of all messages coming into f, and f’s local factor: 
1 
bp+2(@2) = —d2-5 f (£2) + 3 a [0¢ (v1, £2) + O14 (21) + 2f (£2)] (22.181) 


Similarly, the outgoing message from variable 2 to factor f is a function of all the messages 
sent into variable 2 from other connected factors (in this example, just factor h) and the local 
potential: 


52-4 ¢(%2) = 02(2) + n2 (x2) (22.182) 


The key computational bottleneck is computing the max marginals of each factor, where we 
max out all the variables from x + except for x;, i.e., we need to be able to compute the following 
max marginals efficiently: 


max h(x p\ir wi), A(x pants) = Of(Rs) + re) (22.183) 
Jef 

The difference from Equation 22.178 is that we are maxing over all but one of the variables. We 
can solve this efficiently for low treewidth graphical models using message passing; we can also 
solve this efficiently for factors corresponding to bipartite matchings (Duchi et al. 2007) or to 
cardinality constraints (Tarlow et al. 2010). However, there are cases where maximizing over all 
the variables in a factor’s scope is computationally easier than maximizing over all-but-one (see 
(Sontag et al. 2011, Sec 1.5.4) for an example); in such cases, we may prefer to use a subgradient 
method. 

Coordinate descent is a simple algorithm that is often much faster at minimizing the dual than 
gradient descent, especially in the early iterations. It also reduces the objective monotonically, 
and does not need any step size parameters. Unfortunately, it is not guaranteed to converge to 
the global optimum, since L(6) is convex but not strictly convex (which implies there may be 
more than one globally optimizing value). One way to ensure convergence is to replace the max 
function in the definition of L(d) with the soft-max function, which makes the objective strictly 
convex (see e.g., (Hazan and Shashua 2010) for details). 


Recovering the MAP assignment 


So far, we have been focussing on finding the optimal value of 6*. But what we really want is 
the optimal value of x*. In general, computing x* from 6” is NP-hard, even if the LP relaxation 
is tight and the MAP assignment is unique (Sontag et al. 2011, Theorem 1.4). (The troublesome 
cases arise when there are fractional assignments with the same optimal value as the MAP 
estimate.) . 

However, suppose that each a has a unique maximum, x; in this case, we say that ô“ is 
locally decodable to x*. One can show than in this case, the LP relaxation is unique and its 
solution is indeed x*. If many, but not all, of the nodes are uniquely decodable, we can “clamp” 
the uniquely decodable ones to their MAP value, and then use exact inference algorithms to 
figure out the optimal assignment to the remaining variables. Using this method, (Meltzer et al. 
2005) was able to optimally solve various stereo vision CRF estimation problems, and (Yanover 
et al. 2007) was able to optimally solve various protein side-chain structure predicition problems. 

Another approach is to use the upper bound provided by the dual in a branch and bound 
search procedure (Geoffrion 1974). 
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Exercises 


Exercise 22.1 Graphcuts for MAP estimation in binary submodular MRFs 


(Source: Ex. 13.14 of (Koller and Friedman 2009).). Show that using the graph construction described in 
Section 22.6.3.2, the cost of the cut is equal to the energy of the corresponding assignment, up to an 
irrelevant constant. (Warning: this exercise involves a lot of algebraic book-keeping.) 


Exercise 22.2 Graphcuts for alpha-beta swap 


(Source: Ex. 13.15 of (Koller and Friedman 2009).). Show how the optimal alpha-beta swap can be found by 
running min-cut on an appropriately constructed graph. More precisely, 


a. Define a set of binary variables t1,...,tn such that t; = 0 means x, = a, t; = 1 if z; = p, and 
x, = x; is unchanged f x; 4 a and x; Æ b. 


b. Define an energy function over the new variables such that E’(t) = E(x) + const. 


c. Show that Æ’ is submodular if Æ is a semimetric. 


Exercise 22.3 Constant factor optimality for alpha-expansion 


(Source: Daphne Koller.). Let X be a pairwise metric Markov random field over a graph G = (V, E). 
Suppose that the variables are nonbinary and that the node potentials are nonnegative. Let A denote the 
set of labels for each X € ¥. Though it is not possible to (tractably) find the globally optimal assignment 
x* in general, the a-expansion algorithm provides a method for finding assignments ĉ that are locally 
optimal with respect to a large set of transformations, i.e., the possible a-expansion moves. 


Despite the fact that a-expansion only produces a locally optimal MAP assignment, it is possible to prove 
that the energy of this assignment is within a known factor of the energy of the globally optimal solution 
x”. In fact, this is a special case of a more general principle that applies to a wide variety of algorithms, 
including max-product belief propagation and more general move-making algorithms: If one can prove 
that the solutions obtained by the algorithm are ‘strong local minima’, i.e., local minima with respect to 
a large set of potential moves, then it is possible to derive bounds on the (global) suboptimality of these 
solutions, and the quality of the bounds will depend on the nature of the moves considered. (There is a 
precise definition of ‘large set of moves’.) 


Consider the following approach to proving the suboptimality bound for a-expansion. 


a. Let ĉ be a local minimum with respect to expansion moves. For each a € A, let V“ = {s € V | z4 = 
a}, ie, the set of nodes labelled a in the global minimum. Let x’ be an assignment that is equal to 
x* on V and equal to ĉ elsewhere; this is an a-expansion of ĉ. Verify that E(x*) < E(ĉ) < E(x’). 


b. Building on the previous part, show that E(ĉ) < 2cE(a*), where c = maxs,t)ex (Heese) 


minag €st(a,8) 
and E denotes the energy of an assignment. 


Hint. Think about where x’ agrees with ĉ and where it agrees with 2*. 


Exercise 22.4 Dual decomposition for pose segmentation 


(Source: Daphne Koller.). Two important problems in computer vision are that of parsing articulated objects 
(e.g, the human body), called pose estimation, and segmenting the foreground and the background, called 
segmentation. Intuitively, these two problems are linked, in that solving either one would be easier if the 
solution to the other were available. We consider solving these problems simultaneously using a joint 
model over human poses and foreground/background labels and then using dual decomposition for MAP 
inference in this model. 


We construct a two-level model, where the high level handles pose estimation and the low level handles 
pixel-level background segmentation. Let G = (V,€) be an undirected grid over the pixels. Each node 
i € V represents a pixel. Suppose we have one binary variable x; for each pixel, where x; = 1 means that 
pixel i is in the foreground. Denote the full set of these variables by x = (2). 
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In addition, suppose we have an undirected tree structure T = (V’,€’) on the parts. For each body 
part, we have a discrete set of candidate poses that the part can be in, where each pose is characterized 
by parameters specifying its position and orientation. (These candidates are generated by a procedure 
external to the algorithm described here.) Define yj, to be a binary variable indicating whether body part 
j € V’ is in configuration k. Then the full set of part variables is given by y = (y;x), with 7 € V’ 
and k = 1,..., K, where J is the total number of body parts and K is the number of candidate poses 
for each part. Note that in order to describe a valid configuration, y must satisfy the constraint that 
DA Yjk = 1 for each j. 


Suppose we have the following energy function on pixels: 


iev (i j)EE 
Assume that the 6;; arises from a metric (e.g, based on differences in pixel intensities), so this can be 
viewed as the energy for a pairwise metric MRF with respect to G. 


We then have the following energy function for parts: 


Ex(y) = 5 Op(Yp) + X Opa (Yp: Ya). 


pEV' (p,a)EE' 


Since each part candidate y;, is assumed to come with a position and orientation, we can compute a 
binary mask in the image plane. The mask assigns a value to each pixel, denoted by {w}, }icv, where 


Wyk = 1 if pixel į lies on the skeleton and decreases as we move away. We can use this to define an 
energy function relating the parts and the pixels: 


Es(x,y) = X > Y les = 0, ye = 1] why. 


iEV jEV’ k=1 


In other words, this energy term only penalizes the case where a part candidate is active but the pixel 
underneath is labeled as background. 


Formulate the minimization of FE, + E2 + Ez as an integer program and show how you can use dual 
decomposition to solve the dual of this integer program. Your solution should describe the decomposition 
into slaves, the method for solving each one, and the update rules for the overall algorithm. Briefly justify 
your design choices, particularly your choice of inference algorithms for the slaves. 


23.1 


23.2 


23.2.1 


Monte Carlo inference 


Introduction 


So far, we discussed various deterministic algorithms for posterior inference. These meth- 
ods enjoy many of the benefits of the Bayesian approach, while still being about as fast as 
optimization-based point-estimation methods. The trouble with these methods is that they can 
be rather complicated to derive, and they are somewhat limited in their domain of applicabil- 
ity (e.g., they usually assume conjugate priors and exponential family likelihoods, although see 
(Wand et al. 2011) for some recent extensions of mean field to more complex distributions). Fur- 
thermore, although they are fast, their accuracy is often limited by the form of the approximation 
which we choose. 

In this chapter, we discuss an alternative class of algorithms based on the idea of Monte Carlo 
approximation, which we first introduced in Section 2.7. The idea is very simple: generate some 
(unweighted) samples from the posterior, xê ~ p(x|D), and then use these to compute any 
quantity of interest, such as a posterior marginal, p(x,|D), or the posterior of the difference of 
two quantities, p(xı — x2\D), or the posterior predictive, p(y|D), etc. All of these quantities 


can be approximated by E [f|D] ~ DE f (x°) for some suitable function f. 

By generating enough samples, we can achieve any desired level of accuracy we like. The main 
issue is: how do we efficiently generate samples from a probability distribution, particularly in 
high dimensions? In this chapter, we discuss non-iterative methods for generating independent 
samples. In the next chapter, we discuss an iterative method known as Markov Chain Monte 
Carlo, or MCMC for short, which produces dependent samples but which works well in high 
dimensions. Note that sampling is a large topic. The reader should consult other books, such as 
(Liu 2001; Robert and Casella 2004), for more information. 


Sampling from standard distributions 


We briefly discuss some ways to sample from 1 or 2 dimensional distributions of standard form. 
These methods are often used as subroutines by more complex methods. 
Using the cdf 


The simplest method for sampling from a univariate distribution is based on the inverse prob- 
ability transform. Let F be a cdf of some distribution we want to sample from, and let F~! 
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Figure 23.1 Sampling using an inverse CDF. Figure generated by sampleCdf. 


be its inverse. Then we have the following result. 

Theorem 23.2.1. IfU ~ U(0,1) is a uniform rv, then F~1(U) ~ F. 

Proof. 
Pr(F-1(U) < 2) 


Pr(U < F(x)) (applying F to both sides) (23.1) 
= F(a) (because Pr(U < y) = y (23.2) 


where the first line follows since F is a monotonic function, and the second line follows since 
U is uniform on the unit interval. 


Hence we can sample from any univariate distribution, for which we can evaluate its inverse 
cdf, as follows: generate a random number u ~ U(0,1) using a pseudo random number 
generator (see e.g., (Press et al. 1988) for details). Let u represent the height up the y axis. Then 
“slide along” the x axis until you intersect the F curve, and then “drop down” and return the 
corresponding x value. This corresponds to computing x = F~'(u). See Figure 23.1 for an 
illustration. 

For example, consider the exponential distribution 


Expon(a|\) £ Ae~** I(x > 0) (23.3) 
The cdf is 
F(x) =1-e-“* I(x > 0) (23.4) 


whose inverse is the quantile function 


In(1 — p) 
À 
By the above theorem, if U ~ Unif(0, 1), we know that F-1(U) ~ Expon(A). Furthermore, 


since 1 — U ~ Unif(0,1) as well, we can sample from the exponential distribution by first 
sampling from the uniform and then transforming the results using — ln(u)/A. 


F-\(p) = (23.5) 
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Sampling from a Gaussian (Box-Muller method) 


We now describe a method to sample from a Gaussian. The idea is we sample uniformly from 
a unit radius circle, and then use the change of variables formula to derive samples from a 
spherical 2d Gaussian. This can be thought of as two samples from a ld Gaussian. 

In more detail, sample 21, z2 E€ (—1,1) uniformly, and then discard pairs that do not satisfy 
z? + 23 < 1. The result will be points uniformly distributed inside the unit circle, so p(z) = 
4I(z inside circle). Now define 


—2Inr? 2 
t= % | —— (23.6) 
r 


2 


for i = 1 : 2, where r* = ge + en. Using the multivariate change of variables formula, we have 


21,2 1 1 1 1 
ole.) = plan z) EELE = | ex(—520)] [= eia] 237 
Hence xı and z2 are two independent samples from a univariate Gaussian. This is known as 
the Box-Muller method. 
To sample from a multivariate Gaussian, we first compute the Cholesky decomposition of its 
covariance matrix, © = LLT, where L is lower triangular. Next we sample x ~ M (0, I) using 
the Box-Muller method. Finally we set y = Lx + p. This is valid since 


cov [y] = Lcov [x] LT = LIL? = 5 (23.8) 


Rejection sampling 


When the inverse cdf method cannot be used, one simple alternative is to use rejection sam- 
pling, which we now explain. 


Basic idea 


In rejection sampling, we create a proposal distribution q(x) which satisifes Mq(x) > p(x), 
for some constant M, where p(x) is an unnormalized version of p(x) (ie, p(x) = p(x)/Zp 
for some possibly unknown constant Z,). The function M/q(x) provides an upper envelope for 
p. We then sample x ~ q(x), which corresponds to picking a random «x location, and then 
we sample u ~ U(0,1), which corresponds to picking a random height (y location) under the 
envelope. If u > —— we reject the sample, otherwise we accept it. See Figure 23.2(a). where 
the acceptance region is shown shaded, and the rejection region is the white region between 
the shaded zone and the upper envelope. 
We now prove that this procedure is correct. Let 


S = { (x,u): u < p(x)/Maq(x)}, So = {(a,u) : £ < zo, u < p(x)/Maq(x)} (23.9) 
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— target p(x) 
‘+. | = = = = comparison function Mq(x) 


Figure 23.2 (a) Schematic illustration of rejection sampling. Source: Figure 2 of (Andrieu et al. 2003). 
Used with kind permission of Nando de Freitas. (b) Rejection sampling from a Ga(a = 5.7, = 2) 
distribution (solid blue) using a proposal of the form MGa(k, A — 1) (dotted red), where k = |5.7] = 5. 
The curves touch at a — k = 0.7. Figure generated by rejectionSamplingDemo. 


Then the cdf of the accepted points is given by 


P(x < xo, x accepted 
P(x < aol accepted) = ( Pla sr ) (23.10) 


2 J f(x, u) € So)q(x)dudx _ JES, D(a) dx (23.11) 
J f(z, u) € S)q(a)dudx —f[™_ (x) dax ` 


oO 


which is the cdf of p(x), as desired. 


How efficient is this method? Since we generate with probability q(x) and accept with 


probability oes 


P) 


the probability of acceptance is 


p(accept) = l AEL ate)ae = = p(a)dx (23.12) 


Hence we want to choose M as small as possible while still satisfying Mgq(x) > p(x). 


Example 


For example, suppose we want to sample from a Gamma distribution:! 


Ga(ala, A) = a?!) exp(—Az) (23.13) 


1 
T(a) 
One can show that if X; ~ Expon(A), and Y = Xı +---+ Xp, then Y ~ Ga(k, A). For 


non-integer shape parameters, we cannot use this trick. However, we can use rejection sampling 


1. This section is based on notes by Ioana A. Cosma, available at http: //users.aims.ac.za/~ioana/cp2.pdf. 
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(a) (b) (c) 


Figure 23.3 (a) Idea behind adaptive rejection sampling. We place piecewise linear upper (and lower) 
bounds on the log-concave density. Based on Figure 1 of (Gilks and Wild 1992). Figure generated by 
arsEnvelope. (b-c) Using ARS to sample from a half-Gaussian. Figure generated by arsDemo, written by 
Daniel Faton. 


using a Ga(k, A — 1) distribution as a proposal, where k = |a]. The ratio has the form 


plz) Ga(z|a, A) x°—!)* exp(—Az)/T(a) (23.14) 
q(x) ~~ Ga(z|k,A—1)  xk-1(\ — 1)¥ exp(—(A — 1)x)/T (k) ` 
T(k)Ae =} 
= A E exp(—2) (23.15) 
This ratio attains its maximum when x = a — k. Hence 
= N (23.16) 


~ Gala — klk, A — 1) 


See Figure 23.2(b) for a plot. (Exercise 23.2 asks you to devise a better proposal distribution 
based on the Cauchy distribution.) 


Application to Bayesian statistics 


Suppose we want to draw (unweighted) samples from the posterior, p(0|D) = p(D|@)p(0) /p(D). 
We can use rejection sampling with p(0) = p(D|@)p(@) as the target distribution, g(@) = p(@) 


as our proposal, and M = p(D|@), where O = argmaxp(D|@) is the MLE; this was first 
suggested in (Smith and Gelfand 1992). We accept points with probability 


DO) _ p(D|O) 
Mq(9) — p(DI@) 


Thus samples from the prior that have high likelihood are more likely to be retained in the 
posterior. Of course, if there is a big mismatch between prior and posterior (which will be the 
case if the prior is vague and the likelihood is informative), this procedure is very inefficient. We 
discuss better algorithms later. 


(23.17) 


Adaptive rejection sampling 


We now describe a method that can automatically come up with a tight upper envelope q(x) 
to any log concave density p(x). The idea is to upper bound the log density with a piecewise 
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linear function, as illustrated in Figure 23.3(a). We choose the initial locations for the pieces 
based on a fixed grid over the support of the distribution. We then evaluate the gradient of the 
log density at these locations, and make the lines be tangent at these points. 

Since the log of the envelope is piecewise linear, the envelope itself is piecewise exponential: 


q(x) = Midi exp(—Ai(a@ — £i—-1)), Zi-1 < T< Ti (23.18) 


where x; are the grid points. It is relatively straightforward to sample from this distribution. If 
the sample «x is rejected, we create a new grid point at x, and thereby refine the envelope. As 
the number of grid points is increased, the tightness of the envelope improves, and the rejection 
rate goes down. This is known as adaptive rejection sampling (ARS) (Gilks and Wild 1992). 
Figure 23.3(b-c) gives an example of the method in action. As with standard rejection sampling, 
it can be applied to unnormalized distributions. 


Rejection sampling in high dimensions 


It is clear that we want to make our proposal q(x) as close as possible to the target distribution 
p(x), while still being an upper bound. But this is quite hard to achieve, especially in high 
dimensions. To see this, consider sampling from p(x) = N (0, 071) using as a proposal 
q(x) = N (0, 02I). Obviously we must have of > o% in order to be an upper bound. In D 
dimensions, the optimum value is given by M = (0,/o,)”. The acceptance rate is 1/M (since 
both p and q are normalized), which decreases exponentially fast with dimension. For example, 
if a, exceeds op by just 1%, then in 1000 dimensions the acceptance ratio will be about 1/20,000. 
This is a fundamental weakness of rejection sampling. 

In Chapter 24, we will describe MCMC sampling, which is a more efficient way to sample 
from high dimensional distributions. Sometimes this uses (adaptive) rejection sampling as a 
subroutine, which is known as adaptive rejection Metropolis sampling (Gilks et al. 1995). 


Importance sampling 


We now describe a Monte Carlo method known as importance sampling for approximating 
integrals of the form 


I=E[|f] = J Epeak (23.19) 


Basic idea 


The idea is to draw samples x in regions which have high probability, p(x), but also where 
|f(x)| is large. The result can be super efficient, meaning it needs less samples than if we 
were to sample from the exact distribution p(x). The reason is that the samples are focussed 
on the important parts of space. For example, suppose we want to estimate the probability of 
a rare event. Define f(x) = I(x € E), for some set Æ. Then it is better to sample from a 
proposal of the form q(x) x f(x)p(x) than to sample from p(x) itself. 

Importance sampling samples from any proposal, g(x). It then uses these samples to estimate 
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the integral as follows: 


i= f roaa 5 weft’) =f 3.20 


where w, = y a are the importance weights. Note that, unlike rejection sampling, we use all 


the samples. 
How should we choose the proposal? A natural criterion is to minimize the variance of the 
estimate J = $, ws f(x*). Now 


val g(x) [f (x)w(x)] = Erc [f7(x)w?(x)] — 1? (23.21) 


Since the last term is independent of q, we can ignore it. By Jensen’s inequality, we have the 
following lower bound: 


2 
yoo [PCO] > (Bape Fw = ( f rola) 03.22 
The lower bound is obtained when we use the optimal importance distribution: 


eE) 
EI 


When we don't have a particular target function f(x) in mind, we often just try to make 
q(x) as close as possible to p(x). In general, this is difficult, especially in high dimensions, but 
it is possible to adapt the proposal distribution to improve the approximation. This is known as 
adaptive importance sampling (Oh and Berger 1992). 


q(x) (23.23) 


Handling unnormalized distributions 


It is frequently the case that we can evaluate the unnormalized target distribution, p(x), but not 
its normalization constant, Zp. We may also want to use an unnormalized proposal, g(x), with 
possibly unknown normlization constant Z,. We can do this as follows. First we evaluate 


tal = 
Me 
& 
a 
ig 
E 


z [f] = Z| roaa x 7 (23.24) 


where w, = aa is the unnormalized importance weight. We can use the same set of samples 


to evaluate the ratio Z,,/Z, as follows: 


is S 
2 / pode J PD ota a aoa, (23.25) 


1 s S 
f= 3 Da eft) =X wife) (23.26) 
I , 
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> 
Ẹ 


w, ê (23.27) 


a Ws! 
are the normalized importance weights. The resulting estimate is a ratio of two estimates, and 
hence is biased. However, as S — oo, we have that J — J, under weak assumptions (see e.g., 
(Robert and Casella 2004) for details). 


Importance sampling for a DGM: likelihood weighting 


We now describe a way to use importance sampling to generate samples from a distribution 
which can be represented as a directed graphical model (Chapter 10). 

If we have no evidence, we can sample from the unconditional joint distribution of a DGM 
p(x) as follows: first sample the root nodes, then sample their children, then sample their 
children, etc. This is known as ancestral sampling. It works because, in a DAG, we can always 
topologically order the nodes so that parents preceed children. (Note that there is no equivalent 
easy method for sampling from an unconditional undirected graphical model.) 

Now suppose we have some evidence, so some nodes are “clamped” to observed values, and 
we want to sample from the posterior p(x|D). If all the variables are discrete, we can use the 
following simple procedure: perform ancestral sampling, but as soon as we sample a value that 
is inconsistent with an observed value, reject the whole sample and start again. This is known 
as logic sampling (Henrion 1988). 

Needless to say, logic sampling is very inefficient, and it cannot be applied when we have 
real-valued evidence. However, it can be modified as follows. Sample unobserved variables as 
before, conditional on their parents. But don’t sample observed variables; instead we just use 
their observed values. This is equivalent to using a proposal of the form 


= [J pl(wilxpaw) | J de; (x2) (23.28) 
tgE tEE 


where F is the set of observed nodes, and 27 is the observed value for node t. We should 
therefore give the overall sample an importance weight as follows: 


w(x) = ee Il ple pate) m- Sn w = [| pelzo) (23.29) 


q(x) IgE p(zilXpat)) tEE iCE 


This technique is known as likelihood weighting (Fung and Chang 1989; Shachter and Peot 
1989). 
Sampling importance resampling (SIR) 


We can draw unweighted samples from p(x) by first using importance sampling (with proposal 
q) to generate a distribution of the form 


x 5 Wsôxs (x) (23.30) 
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where ws are the normalized importance weights. We then sample with replacement from 
Equation 23.30, where the probability that we pick x* is ws. Let this procedure induce a 
distribution denoted by p. To see that this is valid, note that 


_ >o l(e* < xo)p(2*)/q(z*) 


p(x < zo) = D < £o)Ws x, p(x) /q(x8) 293) 
T T Ble) L)AL 
JX 3 0) Gay Ua) (23.32) 
J Fapal@)ae 


This is known as sampling importance resampling (SIR) (Rubin 1998). The result is an un- 
weighted approximation of the form 


P(x) = 5 >. de (x) (23.34) 


Note that we typically take S’< S. 

This algorithm can be used to perform Bayesian inference in low-dimensional settings (Smith 
and Gelfand 1992). That is, suppose we want to draw (unweighted) samples from the posterior, 
p(@|D) = p(D|@)p(@)/p(D). We can use importance sampling with p(@) = p(D|@)p(@) as 
the unnormalized posterior, and g(@) = p(@) as our proposal. The normalized weights have the 
form 


p(9s)/q(Os) _ p(D9s) (23.35) 


W, = = = 
Yi BOs) /G(Os') Xa plos) 
We can then use SIR to sample from p(@|D). 
Of course, if there is a big discrepancy between our proposal (the prior) and the target (the 
posterior), we will need a huge number of importance samples for this technique to work reliably, 


since otherwise the variance of the importance weights will be very large, implying that most 
samples carry no useful information. (This issue will come up again in Section 23.5, when we 
discuss particle filtering.) 


Particle filtering 


Particle filtering (PF) is a Monte Carlo, or simulation based, algorithm for recursive Bayesian 
inference. That is, it approximates the predict-update cycle described in Section 18.3.1. It is 
very widely used in many areas, including tracking, time-series forecasting, online parameter 
learning, etc. We explain the basic algorithm below. For a book-length treatment, see (Doucet 
et al. 2001); for a good tutorial, see (Arulampalam et al. 2002), or just read on. 
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Sequential importance sampling 


The basic idea is to appproximate the belief state (of the entire state trajectory) using a weighted 
set of particles: 


p(Za:tl¥1:t) a2 Ôzs (Z1:t) (23.36) 


where w; is the normalized weight of sample s at time t. From this representation, we can 
easily compute the marginal distribution over the most recent state, p(z:|y1:t+), by simply 
ignoring the previous parts of the trajectory, Z1.+—1. (The fact that PF samples in the space of 
entire trajectories has various implications which we will discuss later.) 

We update this belief state using importance sampling. If the proposal has the form 
q(zi.4|¥1-4), then the importance weights are given by 


S xX P(Zh4|¥1t) 


(23.37) 
‘ q(zi aly it) 
which can be normalized as follows: 
s 
i= (23.38) 
a Wi 

We can rewrite the numerator recursively as follows: 

p(titlyit) = P(Ye|Z1:t, Y1:t-1)P(Z1tly1:+-1) (23.39) 
P(YtlY1:t-1) 
2 P(Yt|Ze)P(Ze|Z1:2 1, Y1:t 1) p(Za:t 1|Y1:t 1) (23.40) 
P(y¥elY1:t—-1) 

x pl ye|Z+)p(Zt|Ze—1)P(Z1¢-1/¥1:¢-1) (23.41) 


where we have made the usual Markov assumptions. We will restrict attention to proposal 
densities of the following form: 


q(Zil¥i) = q(Zt|Z14-1, 14) ¢(Zit—1¥14-1) (23.42) 


so that we can “grow” the trajectory by adding the new state z, to the end. In this case, the 
importance weights simplify to 


s x p(yi|zi p(z? lzi 1 )P(Zit-1lY11-1) 


w - (23.43) 
i (2 | 294-15 Vit) (2) 41 /¥14t—-1) 
P(y+l2¢ plz lzi) 
rar (23.44) 


Q(z? |25.4_1 Yit) 


If we further assume that q(z:|Z1:t+-1, Y1:t) = q(Z:|Z+-1, yt), then we only need to keep the 
most recent part of the trajectory and observation sequence, rather than the whole history, in 
order to compute the new sample. In this case, the weight becomes 


u oc wt, POvel@A PCa lat) 
f ta q(zilzi—1: yt) 


(23.45) 
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Hence we can approximate the posterior filtered density using 


P(zil¥i:t) >> woe (23.46) 


As S —> œ, one can show that this approaches the true posterior (Crisan et al. 1999). 

The basic algorithm is now very simple: for each old sample s, propose an extension using 
zi ~ q(Z|Zi_1, y+), and give this new particle weight w? using Equation 23.45. Unfortunately, 
this basic algorithm does not work very well, as we discuss below. 


The degeneracy problem 


The basic sequential importance sampling algorithm fails after a few steps because most of 
the particles will have negligible weight. This is called the degeneracy problem, and occurs 
because we are sampling in a high-dimensional space (in fact, the space is growing in size over 
time), using a myopic proposal distribution. 

We can quantify the degree of degeneracy using the effective sample size, defined by 


S 


Ao n 23.47 
eff ~ 1+ var [ws] eee 


where w¥5 = p(z?|y1-4)/q(z;|Zi_1, y+) is the “true weight” of particle s. This quantity cannot 
be computed exactly, since we don’t know the true posterior, but we can approximate it using 


Se = = (23.48) 
yi (wf)? 


If the variance of the weights is large, then we are wasting our resources updating particles with 
low weight, which do not contribute much to our posterior estimate. 

There are two main solutions to the degeneracy problem: adding a resampling step, and using 
a good proposal distribution. We discuss both of these in turn. 


The resampling step 


The main improvement to the basic SIS algorithm is to monitor the effective sampling size, 
and whenever it drops below a threshold, to eliminate particles with low weight, and then 
to create replicates of the surviving particles. (Hence PF is sometimes called survival of the 


fittest (Kanazawa et al. 1995).) In particular, we generate a new set {z**}2_, by sampling with 
replacement S times from the weighted distribution 
P(zel¥1:t) >> WE ðn; (ze) (23.49) 


where the probability of choosing particle j for replication is wl . (This is sometimes called 
rejuvenation.) The result is an iid unweighted sample from the discrete density Equation 23.49, 


so we set the new weights to wf = 1/S. This scheme is illustrated in Figure 23.4. 
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P(z(t-1) | y(1:t-1)) 
proposal | | ] | N 

P(z(t) | y(1:t-1)) 
weighting --- > 


resample ° P(z(t) | y(1:t)) 


Figure 23.4 Illustration of particle filtering. 


There are a variety of algorithms for peforming the resampling step. The simplest is multi- 
nomial resampling, which computes 


(Ky,..., Ks) ~ Mu(S, (w},..., w?)) (23.50) 


We then make K, copies of zł. Various improvements exist, such as systematic resampling 
residual resampling, and stratified sampling, which can reduce the variance of the weights. 
All these methods take O(S) time. See (Doucet et al. 2001) for details. 

The overall particle filtering algorithm is summarized in Algorithm 6. (Note that if an estimate 
of the state is required, it should be computed before the resampling step, since this will result 
in lower variance.) 


Algorithm 23.1: One step of a generic particle filter 
1 for s = 1 : S do 


2 Draw z? ~ q(Zt|Z?_1, yt) ; 
pl(y:|zi)p(zi lzi) 


3 Compute weight w? x wê > 
P 8 t t—1 q(zglze_y.ye) 
š 2 w? 
4 Normalize weights: a = 5*7 ; 
Dig! VE 


5 Compute Seg = 


= Sw 


6 if Sop < Smin then 


7 Resample S indices m ~ wz; 
oes Te. 

8 t= 2; 

9 wi =1/S; 


Although the resampling step helps with the degeneracy problem, it introduces problems of 
its own. In particular, since the particles with high weight will be selected many times, there is 
a loss of diversity amongst the population. This is known as sample impoverishment. In the 
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extreme case of no process noise (e.g., if we have static but unknown parameters as part of the 
state space), then all the particles will collapse to a single point within a few iterations. 

To mitigate this problem, several solutions have been proposed. (1) Only resample when 
necessary, not at every time step. (The original bootstrap filter (Gordon 1993) resampled at 
every step, but this is suboptimal.) (2) After replicating old particles, sample new values using 
an MCMC step which leaves the posterior distribution invariant (see e.g., the resample-move 
algorithm in (Gilks and Berzuini 2001). (3) Create a kernel density estimate on top of the 
particles, 


S 
plzily14) ~ X wisla — 2?) (23.51) 
a=] 


where « is some smoothing kernel. We then sample from this smoothed distribution. This is 
known as a regularized particle filter (Musso et al. 2001). (4 When performing inference on 
static parameters, add some artificial process noise. (If this is undesirable, other algorithms must 
be used for online parameter estimation, e.g., (Andrieu et al. 2005)). 


The proposal distribution 


The simplest and most widely used proposal distribution is to sample from the prior: 
q(zi|z}—1: Y+) = p(Zelzz_1) (23.52) 
In this case, the weight update simplifies to 
wy x wiplyilzi) (23.53) 


This can be thought of a “generate and test” approach: we sample values from the dynamic 
model, and then evaluate how good they are after we see the data (see Figure 23.4). This 
is the approach used in the condensation algorithm (which stands for “conditional density 
propagation”) used for visual tracking (Isard and Blake 1998). However, if the likelihood is 
narrower than the dynamical prior (meaning the sensor is more informative than the motion 
model, which is often the case), this is a very inefficient approach, since most particles will be 
assigned very low weight. 

It is much better to actually look at the data y; when generating a proposal. In fact, the 
optimal proposal distribution has the following form: 


_ P(yt|2e)p(Ze|Z¢_1) 


q(Z+|Z_1,¥t) = p(ZelZz_1,y (23.54) 
( t| t-1 t) ( tl t—1 t) plyilzs_,) 
If we use this proposal, the new weight is given by 

wf x wf iple?) = wha | ply) 23.55) 


This proposal is optimal since, for any given z?_,, the new weight wf takes the same value 
regardless of the value drawn for zł. Hence, conditional on the old values z;_,, the variance of 
true weights var [w;*], is zero. 
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In general, it is intractable to sample from p(z:|z/_,, yz) and to evaluate the integral needed 
to compute the predictive density p(y;|z/_,). However, there are two cases when the optimal 
proposal distribution can be used. The first setting is when z+ is discrete, so the integral becomes 
a sum. Of course, if the entire state space is discrete, we can use an HMM filter instead, but 
in some cases, some parts of the state are discrete, and some continuous. The second setting 
is when p(zs|zf_4, y+) is Gaussian. This occurs when the dynamics are nonlinear but the 
observations are linear. See Exercise 23.3 for the details. 

In cases where the model is not linear-Gaussian, we may still compute a Gaussian approxima- 
tion to p(Zz|z?_, yz) using the unscented transform (Section 18.5.2) and use this as a proposal. 
This is known as the unscented particle filter (van der Merwe et al. 2000). In more general 
settings, we can use other kinds of data-driven proposals, perhaps based on discriminative 
models. Unlike MCMC, we do not need to worry about the proposals being reversible. 


Application: robot localization 


Consider a mobile robot wandering around an office environment. We will assume that it already 
has a map of the world, represented in the form of an occupancy grid, which just specifies 
whether each grid cell is empty space or occupied by an something solid like a wall. The goal 
is for the robot to estimate its location. This can be solved optimally using an HMM filter, since 
we are assuming the state space is discrete. However, since the number of states, K, is often 
very large, the O(K?) time complexity per update is prohibitive. We can use a particle filter as 
a sparse approximation to the belief state. This is known as Monte Carlo localization, and is 
described in detail in (Thrun et al. 2006). 

Figure 23.5 gives an example of the method in action. The robot uses a sonar range finder, 
so it can only sense distance to obstacles. It starts out with a uniform prior, reflecting the fact 
that the owner of the robot may have turned it on in an arbitrary location. (Figuring out where 
you are, starting from a uniform prior, is called global localization.) After the first scan, which 
indicates two walls on either side, the belief state is shown in (b). The posterior is still fairly 
broad, since the robot could be in any location where the walls are fairly close by, such as a 
corridor or any of the narrow rooms. After moving to location 2, the robot is pretty sure it must 
be in the corridor, as shown in (c). After moving to location 3, the sensor is able to detect the 
end of the corridor. However, due to symmetry, it is not sure if it is in location I (the true 
location) or location II. (This is an example of perceptual aliasing, which refers to the fact that 
different things may look the same.) After moving to locations 4 and 5, it is finally able to figure 
out precisely where it is. The whole process is analogous to someone getting lost in an office 
building, and wandering the corridors until they see a sign they recognize. 

In Section 23.6.3, we discuss how to estimate location and the map at the same time. 


Application: visual object tracking 


Our next example is concerned with tracking an object (in this case, a remote-controlled heli- 
copter) in a video sequence. The method uses a simple linear motion model for the centroid 
of the object, and a color histogram for the likelihood model, using Bhattacharya distance to 
compare histograms. The proposal distribution is obtained by sampling from the likelihood. See 
(Nummiaro et al. 2003) for further details. 
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(a) Path and reference poses (b) Belief at reference pose 1 


(c) Belief at reference pose 2 (d) Belief at reference pose 3 


(e) Belief at reference pose 4 (f) Belief at reference pose 5 


Figure 23.5 Illustration of Monte Carlo localization. Source: Figure 8.7 of (Thrun et al. 2006). Used 
with kind permission of Sebastian Thrun. 


Figure 23.6 shows some example frames. The system uses S = 250 particles, with an effective 
sample size of Sere = 134. (a) shows the belief state at frame 1. The system has had to resample 
5 times to keep the effective sample size above the threshold of 150; (b) shows the belief state 
at frame 251; the red lines show the estimated location of the center of the object over the last 
250 frames. (c) shows that the system can handle visual clutter, as long as it does not have the 
same color as the target object. (d) shows that the system is confused between the grey of the 
helicopter and the grey of the building. The posterior is bimodal. The green ellipse, representing 
the posterior mean and covariance, is in between the two modes. (e) shows that the probability 
mass has shifted to the wrong mode: the system has lost track. (f) shows the particles spread 
out over the gray building; recovery of the object is very unlikely from this state using this 
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N= 9.109/180.000, Frame = 1, Redistribution =1 N= 142.051/150.000, Frame = 261, Redistribution =140 


N=99,108/160 000, Frame = 297, Redistribution =131 


N = 136.060/10,000, Frame = 283, Redistribution =155 


(d) 


N= 109.495/150.000, Frame = 316, Redistribution =173 


Figure 23.6 Example of particle filtering applied to visual object tracking, based on color histograms. 
(a-c) succesful tracking: green ellipse is on top of the helicopter. (d-f): tracker gets distracted by gray clutter 
in the background. See text for details. Figure generated by pfColorTrackerDemo, written by Sebastien 
Paris. 


proposal. 

We see that the method is able to keep track for a fairly long time, despite the presence 
of clutter. However, eventually it loses track of the object. Note that since the algorithm is 
stochastic, simply re-running the demo may fix the problem. But in the real world, this is not 
an option. The simplest way to improve performance is to use more particles. An alternative 
is to perform tracking by detection, by running an object detector over the image every few 
frames. See (Forsyth and Ponce 2002; Szeliski 2010; Prince 2012) for details. 
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Application: time series forecasting 


In Section 18.2.4, we discussed how to use the Kalman filter to perform time series forecasting. 
This assumes that the model is a linear-Gaussian state-space model. There are many models 
which are either non-linear and/or non-Gaussian. For example, stochastic volatility models, 
which are widely used in finance, assume that the variance of the system and/or observation 
noise changes over time. Particle filtering is widely used in such settings. See e.g., (Doucet et al. 
2001) and references therein for details. 


Rao-Blackwellised particle filtering (RBPF) 


In some models, we can partition the hidden variables into two kinds, q; and z+, such that 
we can analytically integrate out z; provided we know the values of qı:+. This means we only 
have sample qi.;, and can represent p(z;|qi-4) parametrically. Thus each particle s represents 
a value for qf., and a distribution of the form p(z:|Yy1:t, qf.+). These hybrid particles are are 
sometimes called distributional particles or collapsed particles (Koller and Friedman 2009, 
Sec 12.4). 

The advantage of this approach is that we reduce the dimensionality of the space in which 
we are sampling, which reduces the variance of our estimate. Hence this technique is known 
as Rao-Blackwellised particle filtering or RBPF for short, named after Theorem 24.20. The 
method is best explained using a specific example. 


RBPF for switching LG-SSMs 


A canonical example for which RBPF can be applied is the switching linear dynamical system 

(SLDS) model discussed in Section 18.6 (Chen and Liu 2000; Doucet et al. 2001). We can represent 

p(Z:|¥1-, Gj.) using a mean and covariance matrix for each particle s, where q € {1,..., K} 
If we propose from the prior, q(q = k|q7_,), the weight update becomes 


wi x wr_1P(Ytld = k, Qi-t—1, ¥1t-1) = wi_iLin (23.56) 
where 
Li, = Jovia = k, Zi, YraT, Gta )P(Zela = k, yit-19} 4-1; )dZt (23.57) 


The quantity Lf, is the predictive density for the new observation y+ conditioned on q = k and 
the history qj.,_,. In the case of SLDS models, this can be computed using the normalization 
constant of the Kalman filter, Equation 18.41. 

We give some pseudo-code in Algorithm 8. (The step marked “KFupdate” refers to the Kalman 
filter update equations in Section 18.3.1.) This is known as a mixture of Kalman filters. 

If K is small, we can compute the optimal proposal distribution, which is 


plq =Al¥12,di4-1) = Pilt = Alyt) (23.58) 
= Pra Velde = k)i (4 =k) (23.59) 
Pi_-i(¥t) 


7 Li,p(ae = klai) (23.60) 


er Lie PCG = k'laf_) 
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Algorithm 23.2: One step of RBPF for SLDS using prior as proposal 
1 for s = 1 : S do 


2 | k~ plaa); 
3 Gi=k; 
4 (uÊ, Ei, Le.) = KFupdate(uf_1, 1, Yt; 0k); 
5 w; = w L; 

N li ights: ws = — 2 ; 
6 Normalize Wels s: w? Soa 

= il ‘ 

7 CORES Soff = Swe? 
8 if Sop < Simin then 
9 Resample S indices m ~ wz; 
0 | q= qF, H = HF, Ee = BE 
u wi =1/S; 


where we use the following shorthand: 


Pe_-1(-) = PC-l¥ 1-1; Gie-1) (23.6) 
We then sample from p(q:|q7.;_1,¥1:4) and give the resulting particle weight 


we x we_sPyeldie—1»Yie—1) = wia >, [Ligp(ae = klat) (23.62) 
k 
Since the weights of the particles in Equation 23.62 are independent of the new value that is 
actually sampled for q+, we can compute these weights first, and use them to decide which 
particles to propagate. That is, we choose the fittest particles at time ¢ — 1 using information 
from time t. This is called look-ahead RBPF (de Freitas et al. 2004). 

In more detail, the idea is this. We pass each sample in the prior through all K models 
to get K posteriors, one per sample. The normalization constants of this process allow us to 
compute the optimal weights in Equation 23.62. We then resample S indices. Finally, for each 
old particle s that is chosen, we sample one new state qf = k, and use the corresponding 
posterior from the K possible alternative that we have already computed. The pseudo-code is 
shown in Algorithm 7. This method needs O(K S) storage, but has the advantage that each 
particle is chosen using the latest information, y+. 

A further improvement can be obtained by exploiting the fact that the state space is discrete. 
Hence we can use the resampling method of (Fearnhead 2004) which avoids duplicating particles. 


Application: tracking a maneuvering target 


One application of SLDS is to track moving objects that have piecewise linear dynamics. For 
example, suppose we want to track an airplane or missile; q can specify if the object is flying 
normally or is taking evasive action. This is called maneuvering target tracking. 

Figure 23.7 gives an example of an object moving in 2d. The setup is essentially the same 
as in Section 18.2.1, except that we add a three-state discrete Markov chain which controls the 
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Algorithm 23.3: One step of look-ahead RBPF for SLDS using optimal proposal 


1 for s = 1 : S do 

2 for k= 1: K do 

3 L (Mik: thy Lfs) = KFupdate(uf_1, E¢_1, ye, 9x); 
4 we = wilk LE plq = k\q?_4)); 


we, 
ye we , 
6 Resample S indices m ~ wz; 
7 for s € m do 


5 Normalize weights: w? = 


Li, p(ae=klaz_1) A 
X Lipplat=klai i) 


8 Compute optimal proposal p(k|qĵ.4—1; Y1:t) = 


9 Sample k~ p(klaĵt-1; Y1); 
i | qf = k, we = Hip Xi = Digi 
1 w? = 1/8; 


Method misclassification rate MSE Time (seconds) 
PF 0.440 21.051 6.086 
RBPF 0.340 18.168 10.986 


Table 23.1 Comparison of PF an RBPF on the maneuvering target problem in Figure 23.7. 


input to the system. We define u; = 1 and set 
Bı = (0,0,0,0)7, B2 = (—1.225, —0.35, 1.225, 0.35)7, Bs = (1.225, 0.35, —1.225, —0.35)? 


so the system will turn in different directions depending on the discrete state. 

Figure 23.7(a) shows the true state of the system from a sample run, starting at (0,0): the 
colored symbols denote the discrete state, and the location of the symbol denotes the (x, y) 
location. The small dots represent noisy observations. Figure 23.7(b) shows the estimate of 
the state computed using particle filtering with 500 particles, where the proposal is to sample 
from the prior. The colored symbols denote the MAP estimate of the state, and the location of 
the symbol denotes the MMSE (minimum mean square error) estimate of the location, which is 
given by the posterior mean. Figure 23.7(c) shows the estimate computing using RBPF with 500 
particles, using the optimal proposal distribution. A more quantitative comparison is shown in 
Table 23.1. We see that RBPF has slightly better performance, although it is also slightly slower. 

Figure 23.8 visualizes the belief state of the system. In (a) we show the distribution over the 
discrete states. We see that the particle filter estimate of the belief state (second column) is not 
as accurate as the RBPF estimate (third column) in the beginning, although after the first few 
observations performance is similar for both methods. In (b), we plot the posterior over the x 
locations. For simplicity, we use the PF estimate, which is a set of weighted samples, but we 
could also have used the RBPF estimate, which is a set of weighted Gaussians. 
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Figure 23.7 (a) A maneuvering target. The colored symbols represent the hidden discrete state. (b) 
Particle filter estimate. (c) RBPF estimate. Figure generated by rbpfManeuverDemo, based on code by 
Nando de Freitas. 


Application: Fast SLAM 


In Section 18.2.2, we introduced the problem of simultaneous localization and mapping or SLAM 
for mobile robotics. The main problem with the Kalman filter implementation is that it is cubic 
in the number of landmarks. However, by looking at the DGM in Figure 18.2, we see that, 
conditional on knowing the robot's path, q1:+, where q: € R?, the landmark locations z € IR?” 
are independent. (We assume the landmarks don’t move, so we drop the t subscript). That is, 
p(z\qi:t; Yi) = I4, p(zı|q1:+;, Y1:t). Consequently we can use RBPF, where we sample the 
robot's trajectory, qı:+, and we run L independent 2d Kalman filters inside each particle. This 
takes O(L) time per particle. Fortunately, the number of particles needed for good performance 
is quite small (this partly depends on the control / exploration policy), so the algorithm is 
essentially linear in the number of particles. This technique has the additional advantage that 
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pf, error rate 0.440 rbpf, error rate 0.340 


(a) (b) 


Figure 23.8 Belief states corresponding to Figure 23.7. (a) Discrete state. The system starts in state 
2 (red x in Figure 23.7), then moves to state 3 (black * in Figure 23.7), returns briefly to state 2, then 
switches to state 1 (blue circle in Figure 23.7), etc. (b) Horizontal location (PF estimate). Figure generated 
by rbpfManeuverDemo, based on code by Nando de Freitas. 


it is easy to use sampling to handle the data association ambiguity, and that it allows for other 
representations of the map, such as occupancy grids. This idea was first suggested in (Murphy 
2000), and was subsequently extended and made practical in (Thrun et al. 2004), who christened 
the technique FastSLAM. See rbpfSlamDemo for a simple demo in a discrete grid world. 


Exercises 


Exercise 23.1 Sampling from a Cauchy 


Show how to use inverse probability transform to sample from a standard Cauchy, 7 (20,1, 1). 


Exercise 23.2 Rejection sampling from a Gamma using a Cauchy proposal 

Show how to use a Cauchy proposal to perform rejection sampling from a Gamma distribution. Derive the 
optimal constant M, and plot the density and its upper envelope. 

Exercise 23.3 Optimal proposal for particle filtering with linear-Gaussian measurement model 


Consider a state-space model of the following form: 


Z = fi(ze—1) +N(O, Qi-1) (23.63) 
ye = Hz +N(0,R:) (23.64) 


Derive expressions for p(Z¢|Z¢—1, yt) and p(yz|z+-1), which are needed to compute the optimal (minimum 
variance) proposal distribution. Hint: use Bayes rule for Gaussians. 


24.1 


Markov chain Monte Carlo (MCMC) 
inference 


Introduction 


In Chapter 23, we introduced some simple Monte Carlo methods, including rejection sampling 
and importance sampling. The trouble with these methods is that they do not work well in high 
dimensional spaces. The most popular method for sampling from high-dimensional distributions 
is Markov chain Monte Carlo or MCMC. In a survey by SIAM News!, MCMC was placed in the 
top 10 most important algorithms of the 20th century. 

The basic idea behind MCMC is to construct a Markov chain (Section 17.2) on the state space 
X whose stationary distribution is the target density p*(x) of interest (this may be a prior or a 
posterior). That is, we perform a random walk on the state space, in such a way that the fraction 
of time we spend in each state x is proportional to p*(x). By drawing (correlated!) samples 
Xo, X1, X2, ..., from the chain, we can perform Monte Carlo integration wrt p*. We give the 
details below. 

The MCMC algorithm has an interesting history. It was discovered by physicists working 
on the atomic bomb at Los Alamos during World War II, and was first published in the open 
literature in (Metropolis et al. 1953) in a chemistry journal. An extension was published in 
the statistics literature in (Hastings 1970), but was largely unnoticed. A special case (Gibbs 
sampling, Section 24.2) was independently invented in 1984 in the context of Ising models and 
was published in (Geman and Geman 1984). But it was not until (Gelfand and Smith 1990) that 
the algorithm became well-known to the wider statistical community. Since then it has become 
wildly popular in Bayesian statistics, and is becoming increasingly popular in machine learning. 

It is worth briefly comparing MCMC to variational inference (Chapter 21). The advantages 
of variational inference are (l) for small to medium problems, it is usually faster; (2) it is 
deterministic; (3) is it easy to determine when to stop; (4) it often provides a lower bound on 
the log likelihood. The advantages of sampling are: (l) it is often easier to implement; (2) it 
is applicable to a broader range of models, such as models whose size or structure changes 
depending on the values of certain variables (e.g., as happens in matching problems), or models 
without nice conjugate priors; (3) sampling can be faster than variational methods when applied 
to really huge models or datasets.” 


l. Source: http: //www.siam.org/pdf /news/637.pdf. 

2. The reason is that sampling passes specific values of variables (or sets of variables), whereas in variational inference, 
we pass around distributions. Thus sampling passes sparse messages, whereas variational inference passes dense 
messages For comparisons of the two approaches, see e.g., (Yoshida and West 2010) and articles in (Bekkerman et al. 
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Gibbs sampling 


In this section, we present one of the most popular MCMC algorithms, known as Gibbs sam- 
pling.’ (In physics, this method is known as Glauber dynamics or the heat bath method.) This 
is the MCMC analog of coordinate descent. 


Basic idea 


The idea behind Gibbs sampling is that we sample each variable in turn, conditioned on the 
values of all the other variables in the distribution. That is, given a joint sample x° of all the 
variables, we generate a new sample x°*+! by sampling each component in turn, based on the 
most recent values of the other variables. For example, if we have D = 3 variables, we use 


© xi" ~ p(zi|z3, 28) 


° a” ~ p(za|xi" x3) 


e a3 ~ p(zslzi z3 


This readily generalizes to D variables. If x; is a visible variable, we do not sample it, since its 
value is already known. 

The expression p(x;|x—;) is called the full conditional for variable i. In general, x; may only 
depend on some of the other variables. If we represent p(x) as a graphical model, we can infer 
the dependencies by looking at is Markov blanket, which are its neighbors in the graph. Thus 
to sample x;, we only need to know the values of i's neighbors. In this sense, Gibbs sampling 
is a distributed algorithm. However, it is not a parallel algorithm, since the samples must be 
generated sequentially. 

For reasons that we will explain in Section 24.4.1, it is necessary to discard some of the 
initial samples until the Markov chain has burned in, or entered its stationary distribution. We 
discuss how to estimate when burnin has occured in Section 24.4.1. In the examples below, we 
just discard the initial 25% of the samples, for simplicity. 


Example: Gibbs sampling for the Ising model 


In Section 21.3.2, we applied mean field to an Ising model. Here we apply Gibbs sampling. 
Gibbs sampling in pairwise MRF/CRF takes the form 


plzz 0) x J| veles 22) (24.1) 
s€nbr(t) 
In the case of an Ising model with edge potentials w(a,,7,) = exp(Ja,2,), where x, € 
201) 


3. Josiah Willard Gibbs, 1839-1903, was an American physicist. 


24.2. Gibbs sampling 839 


sample 5, Gibbs 


mean after 15 sweeps of Gibbs 


Figure 24.1 Example of image denoising. We use an Ising prior with W;; = J = 1 and a Gaussian 
noise model with o = 2. We use Gibbs sampling (Section 24.2) to perform approximate inference. (a) 
Sample from the posterior after one sweep over the image. (b) Sample after 5 sweeps. (c) Posterior mean, 
computed by averaging over 15 sweeps. Compare to Figure 21.3 which shows the results of using mean 
field inference. Figure generated by isingImageDenoiseDemo. 


{—1, +1}, the full conditional becomes 


Tlsenbet Wet (£: = +1, Ts) 


plei = +1|x—+, 8) (24.2) 
i ' [lecs Ylst = +1, Ts) T Teenbr(t) p(x = —I, Ts) 
exp J sEnbr Ts 
[ a br(t) ] (24,3) 
exp[J J scnbrėt) Ts] + exp[—J a Ts] 
exp|J 
Sxplm] = sigm(2 Jn) (24.4) 


exp[Jm] + expl- Jn] 


where J is the coupling strength, 7, >Jsenbr(t) Tt and sigm(u) = 1/(1+e~") is the sigmoid 
function. It is easy to see that m = 2;(a; — dt), where a; is the number of neighbors that agree 
with (have the same sign as) t, and d is the number of neighbors who disagree. If this number 
is equal, the “forces” on x; cancel out, so the full conditional is uniform. 

We can combine an Ising prior with a local evidence term y. For example, with a Gaussian 
observation model, we have y(x) = N (yt|vz, 07). The full conditional becomes 

exp[Jn]¥e(+1) 
Aline Pad exp|Jm] Yi (+1) + exp[—Jm]v(-1) a 
aD) 

p(—1) 


Now the probability of x, entering each state is determined both by compatibility with its 
neighbors (the Ising prior) and compatibility with the data (the local likelihood term). 

See Figure 24.1 for an example of this algorithm applied to a simple image denoising problem. 
The results are similar to mean field (Figure 21.3) except that the final estimate (based on 
averaging the samples) is somewhat “blurrier”, due to the fact that mean field tends to be 
over-confident. 


= sigm (27m — log (24.6) 
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Example: Gibbs sampling for inferring the parameters of a GMM 


It is straightforward to derive a Gibbs sampling algorithm to “fit” a mixture model, especially 
if we use conjugate priors. We will focus on the case of mixture of Gaussians, although the 
results are easily extended to other kinds of mixture models. (The derivation, which follows from 
the results of Section 4.6, is much easier than the corresponding variational Bayes algorithm in 
Section 21.6.1.) 

Suppose we use a semi-conjugate prior. Then the full joint distribution is given by 


K 
P(X,2, p, Em) = p(x|z, w,E)p(z|r)p(or) | | pepr) (24.7) 
k=1 
N K 
z (TEI corem zore) x (24.8) 
i=l k=1 
: 
Dir(mla) | | M(tx|m0, Vo)IW(Zx|So, vo) (24.9) 
k=1 


We use the same prior for each mixture component. The full conditionals are as follows. For 
the discrete indicators, we have 


P(zi = k|x;, y, 4,7) x TEN (Xi| My, De) (24.10) 
For the mixing weights, we have (using results from Section 3.4) 
N 
p(m|z) = Dir({ax +X Iz =k) Hy) (24.11) 
i=1 


For the means, we have (using results from Section 4.6.1) 


P(My|Ze,z,x) = N (4, |me, Ve) (24.12) 
V! = V+ MNE, (24.13) 
m, = VE; N;Xy + Vo 'mo) (24.14) 
N 
Ny = X (a=k) (24.15) 
i=l 
N 
= A Ži- I(z; = k)xi 
_ = 24.1 
Xk N; (24.16) 
For the covariances, we have (using results from Section 4.6.2) 
p( Spl Mp, Z, X) = IW(£;|Sk, vk) (24.17) 
N 
Sk = Sot > Maj = k)(xi — pp) (&i — He)” (24.18) 
i=i 
Vk = YtwNe (24.19) 


See gaussMissingFitGibbs for some Matlab code. (This code can also sample missing 
values for x, if necessary.) 
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Label switching 


Although it is simple to implement, Gibbs sampling for mixture models has a fundamental 
weakness. The problem is that the parameters of the model 0, and the indicator functions z, are 
unidentifiable, since we can arbitrarily permute the hidden labels without affecting the likelihood 
(see Section 11.3.1). Consequently, we cannot just take a Monte Carlo average of the samples to 
compute posterior means, since what one sample considers the parameters for cluster 1 may be 
what another sample considers the parameters for cluster 2. Indeed, if we could average over 
all modes, we would find E [jz;,|P] is the same for all k (assuming a symmetric prior). This is 
called the label switching problem. 

This problem does not arise in EM or VBEM, which just “lock on” to a single mode. However, 
it arises in any method that visits multiple modes. In ld problems, one can try to prevent this 
problem by introducing constraints on the parameters to ensure identifiability, e.g., pı < H2 < 
u3 (Richardson and Green 1997). However, this does not always work, since the likelihood might 
overwhelm the prior and cause label switching anyway. Furthermore, this technique does not 
scale to higher dimensions. Another approach is to post-process the samples by searching for a 
global label permutation to apply to each sample that minimizes some loss function (Stephens 
2000); however, this can be slow. 

Perhaps the best solution is simply to “not ask” questions that cannot be uniquely identified. 
For example, instead of asking for the probability that data point i belongs to cluster k, ask 
for the probability that data points i and j belong to the same cluster. The latter question is 
invariant to the labeling. Furthermore, it only refers to observable quantities (are i and j grouped 
together or not), rather than referring to unobservable quantities, such as latent clusters. This 
approach has the further advantage that it extends to infinite mixture models, discussed in 
Section 25.2, where K is unbounded; in such models, the notion of a hidden cluster is not well 
defined, but the notion of a partitioning of the data is well defined 


Collapsed Gibbs sampling * 


In some cases, we can analytically integrate out some of the unknown quantities, and just 
sample the rest. This is called a collapsed Gibbs sampler, and it tends to be much more 
efficient, since it is sampling in a lower dimensional space. 

More precisely, suppose we sample z and integrate out 9. Thus the @ parameters do not 
participate in the Markov chain; consequently we can draw conditionally independent samples 
6° ~ p(@|z5, D), which will have much lower variance than samples drawn from the joint state 
space (Liu et al. 1994). This process is called Rao-Blackwellisation, named after the following 
theorem: 


Theorem 24.2.1 (Rao-Blackwell). Let z and @ be dependent random variables, and f(z,0) be 
some scalar function. Then 


varz o |f (z, 0)] > var, [Eo [f (z, 0)|z]] (24.20) 


This theorem guarantees that the variance of the estimate created by analytically integrating 
out 0 will always be lower (or rather, will never be higher) than the variance of a direct MC 
estimate. In collapsed Gibbs, we sample z with @ integrated out; the above Rao-Blackwell 
theorem still applies in this case (Liu et al. 1994). 
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(a) (b) 


Figure 24.2 (a) A mixture model. (b) After integrating out the parameters. 


We will encounter Rao-Blackwellisation again in Section 23.6. Although it can reduce statistical 
variance, it is only worth doing if the integrating out can be done quickly, otherwise we will not 
be able to produce as many samples per second as the naive method. We give an example of 
this below. 


Example: collapsed Gibbs for fitting a GMM 


Consider a GMM with a fully conjugate prior. In this case we can analytically integrate out the 
model parameters pp, Xg and 7, and just sample the indicators z. Once we integrate out m, 
all the z; nodes become inter-dependent. Similarly, once we integrate out 0x, all the x; nodes 
become inter-dependent, as shown in Figure 24.2(b). Nevertheless, we can easily compute the 
full conditionals as follows: 


p(zi = k|lz-i,x, 0,8) x p(z = k|z-i, a, B)p(x|zi = k, z-i, B) (24.21) 
x p(zi = k|z-i, @)p(xi|x-i, zi = k, 2-i, B) 
p(x-ilz=K, Zi, B) (24.22) 


x pla = k|z-i, æ)p(xi|x—i, zi = k, z-i, B) (24.23) 


where 3 = (mo, Vo, So, vo) are the hyper-parameters for the class-conditional densities. The 
first term can be obtained by integrating out m. Suppose we use a symmetric prior of the form 
am ~ Dir(q@), where ag = a/K. From Equation 5.26 we have 


K 
A mea 5 II Be es) (24.24) 
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Hence 


1 
: TiN ata) T(N, K 
acting) = Ame JOLLE ee) (24.25) 
p(z_;|a) TNFa T(Nk, -i + a/K) 
T(N +a—1)T (Nei t1+0/K)  Ny-ita/K 
T(N+a) T(Np-ita/K)  N+a-1 


(24.26) 


where Nk, —; = nzi (2n = k) = Nk — 1, and where we exploited the fact that T(x + 1) = 
al (a). 

To obtain the second term in Equation 24.23, which is the posterior predictive distribution for 
X; given all the other data and all the assignments, we use the fact that 


p(Xi|X—i, Zi, Zi = k, B) = p(xi|D_ix) (24.27) 


where Di = {X; : 2; = k,j A i} is all the data assigned to cluster k except for x;. If we 
use a conjugate prior for Ox, we can compute p(x;|D_;,,) in closed form. Furthermore, we can 
efficiently update these predictive likelihoods by caching the sufficient statistics for each cluster. 
To compute the above expression, we remove x,’s statistics from its current cluster (namely z;), 
and then evaluate x; under each cluster’s posterior predictive. Once we have picked a new 
cluster, we add x;’s statistics to this new cluster. 

Some pseudo-code for one step of the algorithm is shown in Algorithm 1, based on (Sud- 
derth 2006, p94). (We update the nodes in random order to improve the mixing time, as 
suggested in (Roberts and Sahu 1997).) We can initialize the sample by sequentially sampling 
from p(z;|Z1-i-1, X11). (See fmGibbs for some Matlab code, by Yee-Whye Teh.) In the case of 
GMMs, both the naive sampler and collapsed sampler take O(N K D) time per step. 


Algorithm 24.1: Collapsed Gibbs sampler for a mixture model 


1 for eachi = 1: N in random order do 

2 Remove x;,’s sufficient statistics from old cluster z; ; 

3 for each k = 1 : K do 

4 B Compute px(x:) = p(xil{xj : z; = k, j #i}) ; 

s | Compute p(z: = klz D) œ (Ne, + a/K)pr(%:); 
6 Sample z; ~ p(zi|-) ; 

7 Add x;’s sufficient statistics to new cluster z; 


A comparison of this method with the standard Gibbs sampler is shown in Figure 24.3. The 
vertical axis is the data log probability at each iteration, computed using 


N 
log p(Dlz, 0) = X` log fr, p(2518-,)] (24.28) 
i=1 
To compute this quantity using the collapsed sampler, we have to sample 0 = (n, 01:x) given 
the data and the current assignment z. 
In Figure 24.3 we see that the collapsed sampler does indeed generally work better than the 
vanilla sampler. Occasionally, however, both methods can get stuck in poor local modes. (Note 
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Figure 24.3 Comparison of collapsed (red) and vanilla (blue) Gibbs sampling for a mixture of K = 4 two- 
dimensional Gaussians applied to N = 300 data points (shown in Figure 25.7). We plot log probability of 
the data vs iteration. (a) 20 different random initializations. (b) logprob averaged over 100 different random 
initializations. Solid line is the median, thick dashed in the 0.25 and 0.75 quantiles, and thin dashed are 
the 0.05 and 0.95 quintiles. Source: Figure 2.20 of (Sudderth 2006). Used with kind permission of Erik 
Sudderth. 


Figure 24.4 (a) Least squares regression lines for math scores vs socio-economic status for 100 schools. 
Population mean (pooled estimate) is in bold. (b) Plot of w2; (the slope) vs N; (sample size) for the 100 
schools. The extreme slopes tend to correspond to schools with smaller sample sizes. (c) Predictions from 
the hierarchical model. Population mean is in bold. Based on Figure 11.1 of (Hoff 2009). Figure generated 
by multilevelLinregDemo, written by Emtiyaz Khan. 


that the error bars in Figure 24.3(b) are averaged over starting values, whereas the theorem refers 
to MC samples in a single run.) 


Gibbs sampling for hierarchical GLMs 


Often we have data from multiple related sources. If some sources are more reliable and/or 
data-rich than others, it makes sense to model all the data simultaneously, so as to enable the 
borrowing of statistical strength. One of the most natural way to solve such problems is to use 
hierarchical Bayesian modeling, also called multi-level modeling. In Section 9.6, we discussed 
a way to perform approximate inference in such models using variational methods. Here we 
discuss how to use Gibbs sampling. 

To explain the method, consider the following example. Suppose we have data on students 
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Figure 24.5 Multi-level model for linear regression. 


in different schools. Such data is naturally modeled in a two-level hierarchy: we let y;; be the 
response variable we want to predict for student 7 in school j. This prediction can be based on 
school and student specific covariates, x;;. Since the quality of schools varies, we want to use 
a separate parameter for each school. So our model becomes 


Yij = XI Wj + €ij (24.29) 


We will illustrate this model below, using a dataset from (Hoff 2009, p197), where x;; is the 
socio-economic status (SES) of student i in school y, and y;; is their math score. 

We could fit each w; separately, but this can give poor results if the sample size of a given 
school is small. This is illustrated in Figure 24.4(a), which plots the least squares regression 
line estimated separately for each of the J = 100 schools. We see that most of the slopes are 
positive, but there are a few “errant” cases where the slope is negative. It turns out that the lines 
with extreme slopes tend to be in schools with small sample size, as shown in Figure 24.4(b). 
Thus we may not necessarily trust these fits. 

We can get better results if we construct a hierarchical Bayesian model, in which the w; are 
assumed to come from a common prior: wj ~ N (Mu, Sw). This is illustrated in Figure 24.5. In 
this model, the schools with small sample size borrow statistical strength from the schools with 
larger sample size, because the w;’s are correlated via the latent common parents (H, Xw). (It 
is crucial that these hyper-parameters be inferrred from data; if they were fixed constants, the 
w; would be conditionally independent, and there would be no information sharing between 
them.) 

To complete the model specification, we must specify priors for the shared parameters. Fol- 
lowing (Hoff 2009, p198), we will use the following semi-conjugate forms, for convenience: 


My ~ N (Ho, Vo) (24.30) 
Ew ~ IW(no,89") (24.31) 
a? ~ IG(v9/2, v905/2) (24.32) 


Given this, it is simple to show that the full conditionals needed for Gibbs sampling have the 
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following forms. For the group-specific weights: 


p(w,|Dj,8) = N(w;luj, £j) (24.33) 
= = E! +X) X;/o? (24.34) 
By, = J, (2+ X7y;/o7) (24.35) 


For the overall mean: 


P(My|Wi:7, Dw) = N (ujuy, Xx) (24.36) 
Er = Vi +75" (24.37) 
By = En(V}'ho + JE ‘w) (24.38) 


where W = 5 >> j Wj. For the overall covariance: 


P(Vwlly, Wis) = IW((So +S ms + J) (24.39) 


S, = Sow; — fy) ( Ee ay (24.40) 
j 


For the noise variance: 


p(o?|D, wiz) = An + N]/2, [voc + SSR(w1:7)]/2) (24.41) 
SSR(w1:J) = SoS Yij — WF xij)? (24.42) 
j=1 i=1 


Applying Gibbs sampling to our hierarchical model, we get the results shown in Figure 24.4(c). 
The light gray lines plot the mean of the posterior predictive distribution for each school: 


2 [yy lxig] = x4 Wy (24.43) 
where 
Ww; = Elw,|D] ~ 5 DE w (24.44) 


The dark gray line in the middle plots the prediction using the overall mean parameters, xi; Ly 
We see that the method has regularized the fits quite nicely, without enforcing too much 
uniformity. (The amount of shrinkage is controlled by Xw, which in turns depends on the 
hyper-parameters; in this example, we used vague values.) 


BUGS and JAGS 


One reason Gibbs sampling is so popular is that it is possible to design general purpose software 
that will work for almost any model. This software just needs a model specification, usually 
in the form a directed graphical model (specified in a file, or created with a graphical user 
interface), and a library of methods for sampling from different kinds of full conditionals. (This 
can often be done using adaptive rejection sampling, described in Section 23.3.4.) An example 
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of such a package is BUGS (Lunn et al. 2000), which stands for “Bayesian updating using Gibbs 
Sampling”. BUGS is very widely used in biostatistics and social science. Another more recent, 
but very similar, package is JAGS (Plummer 2003), which stands for “Just Another Gibbs Sampler”. 
This uses a similar model specification language to BUGS. 

For example, we can describe the model in Figure 24.5 as follows: 


model { 
for (i in 1:N) { 
for (j in 1:J) { 
yli,j] ~ dnorm(y-hat[i,j], tau.y) 
y.hat[i,j] <- inprod(W[j, ], Xfi, j, J) 
} 
} 
tau.y <- pow(sigma.y, -2) 
sigma.y ~ dunif(0,100) 


for (j in 1:J) { 

W[j,] ~ dmnorm(mu, SigmaInv) 
} 
SigmaInv ~ dwish(SO[,], eta0) 
mu ~ dmnorm(mu0, VOinv) 


} 


We can then just pass this model to BUGS or JAGS, which will generate samples for us. See 
the webpages for details. 

Although this approach is appealing, unfortunately it can be much slower than using hand- 
written code, especially for complex models. There has been some work on automatically 
deriving model-specific optimized inference code (Fischer and Schumann 2003), but fast code 
still typically requires human expertise. 


The Imputation Posterior (IP) algorithm 


The Imputation Posterior or IP algorithm (Tanner and Wong 1987) is a special case of Gibbs 
sampling in which we group the variables into two classes: hidden variables z and parameters 
0. This should sound familiar: it is basically an MCMC version of EM, where the E step gets 
replaced by the I step, and the M step gets replaced the P step. This is an example of a more 
general strategy called data augmentation, whereby we introduce auxiliary variables in order 
to simplify the posterior computations (here the computation of p(@|D)). See (Tanner 1996; van 
Dyk and Meng 2001) for more information. 


Blocking Gibbs sampling 


Gibbs sampling can be quite slow, since it only updates one variable at a time (so-called single 
site updating). If the variables are highly correlated, it will take a long time to move away 
from the current state. This is illustrated in Figure 24.6, where we illustrate sampling from a 2d 
Gaussian (see Exercise 24.1 for the details). If the variables are highly correlated, the algorithm 
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<> 1 
/\ 
33 2 “I 0 t 2 3 
Figure 24.6 Illustration of potentially slow sampling when using Gibbs sampling for a skewed 2D Gaus- 

sian. Based on Figure 11.11 of (Bishop 2006b). Figure generated by gibbsGaussDemo. 


will move very slowly through the state space. In particular, the size of the moves is controlled 
by the variance of the conditional distributions. If this is 2 in the xı direction, and the support 
of the distribution is L along this dimension, then we need O((L/@)?) steps to obtain an 
independent sample. 

In some cases we can efficiently sample groups of variables at a time. This is called blocking 
Gibbs sampling or blocked Gibbs sampling (Jensen et al. 1995; Wilkinson and Yeung 2002), 
and can make much bigger moves through the state space. 


Metropolis Hastings algorithm 


Although Gibbs sampling is simple, it is somewhat restricted in the set of models to which it 
can be applied. For example, it is not much help in computing p(w|D) for a logistic regression 
model, since the corresponding graphical model has no useful Markov structure. In addition, 
Gibbs sampling can be quite slow, as we mentioned above. 

Fortunately, there is a more general algorithm that can be used, known as the Metropolis 
Hastings or MH algorithm, which we describe below. 


Basic idea 


The basic idea in MH is that at each step, we propose to move from the current state x to a 
new state x’ with probability g(x’|x), where q is called the proposal distribution (also called 
the kernel). The user is free to use any kind of proposal they want, subject to some conditions 
which we explain below. This makes MH quite a flexible method. A commonly used proposal is 
a symmetric Gaussian distribution centered on the current state, g(x’|x) = V(x’|x, ©); this is 
called a random walk Metropolis algorithm. We discuss how to choose & in Section 24.3.3. If 
we use a proposal of the form q(x’|x) = q(x’), where the new state is independent of the old 
state, we get a method known as the independence sampler, which is similar to importance 
sampling (Section 23.4). 

Having proposed a move to x’, we then decide whether to accept this proposal or not 
according to some formula, which ensures that the fraction of time spent in each state is 
proportional to p*(x). If the proposal is accepted, the new state is x’, otherwise the new state 
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is the same as the current state, x (i.e., we repeat the sample). 
If the proposal is symmetric, so g(x’|x) = q(x|x’), the acceptance probability is given by the 
following formula: 


) (24.45) 


We see that if x’ is more probable than x, we definitely move there (since 2 a > 1), but if 
x’ is less probable, we may still move there anyway, depending on the relative probabilities. So 
instead of greedily moving to only more probable states, we occasionally allow “downhill” moves 
to less probable states. In Section 24.3.6, we prove that this procedure ensures that the fraction 
of time we spend in each state x is proportional to p* (x). 

If the proposal is asymmetric, so g(x’|x) 4 q(x|x’), we need the Hastings correction, given 
by the following: 


r = min(1,q) (24.46) 


a = Zx) _ p*(x!)/a(x'lx) (24.47) 


axi) Ex) /q(x]x’) 
This correction is needed to compensate for the fact that the proposal distribution itself (rather 
than just the target distribution) might favor certain states. 
An important reason why MH is a useful algorithm is that, when evaluating a, we only need to 
know the target density up to a normalization constant. In particular, suppose p*(x) = p(x), 
where (x) is an unnormalized distribution and Z is the normalization constant. Then 


_ (B()/Z) alx’) 
(D(x) /Z) a(x’|x) 
so the Z’s cancel. Hence we can sample from p* even if Z is unknown. In particular, all we 


have to do is evaluate p pointwise, where p(x) = p*(x)Z. 
The overall algorithm is summarized in Algorithm 2. 


(24.48) 


Gibbs sampling is a special case of MH 


It turns out that Gibbs sampling, which we discussed in Section 24.2, is a special case of MH. In 
particular, it is equivalent to using MH with a sequence of proposals of the form 


a(x’ |x) = p(a;|x_,)I(x_; = x-i) (24.49) 


That is, we move to a new state where x; is sampled from its full conditional, but x_, is left 
unchanged. 

We now prove that the acceptance rate of each such proposal is 1, so the overall algorithm 
also has an acceptance rate of 100%. We have 


o p(x')a(x|x’) pai |x; pla plei) 
a = ai hag po asop xa) (24.50) 


x. 1 (24.51) 
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Algorithm 24.2: Metropolis Hastings algorithm 
1 Initialize x° ; 

2 for s=0,1,2,... do 

3 Define z = 2°; 

4 Sample x’ ~ q(2’|x); 

5 Compute acceptance probability 


P(x"')q(a| x") 


T Boele) 


Compute r = min(1, a); 

6 Sample u ~ U (0,1); 

7 Set new sample to 
aai Ja ifur 

i ~ ) # ifu>r 


; = X; and that q(x’|x) = p(a',|x_;). 

The fact that the acceptance rate is 100% does not necessarily mean that Gibbs will converge 
rapidly, since it only updates one coordinate at a time (see Section 24.2.8). Fortunately, there are 
many other kinds of proposals we can use, as we discuss below. 


where we exploited the fact that x’_, 


Proposal distributions 


For a given target distribution p*, a proposal distribution q is valid or admissible if it gives 
a non-zero probability of moving to the states that have non-zero probability in the target. 
Formally, we can write this as 


supp(p*) C U,supp(q(-|x)) (24.52) 


For example, a Gaussian random walk proposal has non-zero probability density on the entire 
state space, and hence is a valid proposal for any continuous state space. 

Of course, in practice, it is important that the proposal spread its probability mass in just the 
right way. Figure 24.7 shows an example where we use MH to sample from a mixture of two 
1D Gaussians using a random walk proposal, q(x'|x) = N (x'|x, v). This is a somewhat tricky 
target distribution, since it consists of two well separated modes. It is very important to set the 
variance of the proposal v correctly: If the variance is too low, the chain will only explore one 
of the modes, as shown in Figure 24.7(a), but if the variance is too large, most of the moves 
will be rejected, and the chain will be very sticky, i.e., it will stay in the same state for a long 
time. This is evident from the long stretches of repeated values in Figure 24.7(b). If we set 
the proposal’s variance just right, we get the trace in Figure 24.7(c), where the samples clearly 
explore the support of the target distribution. We discuss how to tune the proposal below. 

One big advantage of Gibbs sampling is that one does not need to choose the proposal 
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MH with N(0,1.000°) proposal MH with N(0,500.000°) proposal 


100 100 


0 0 


Iterations 1000 _100 Samples Iterations 1000 _100 Samples 


MH with N(0,8.000°) proposal 


800 


Iterations 1000 _100 Samples 


(c) 


Figure 24.7 An example of the Metropolis Hastings algorithm for sampling from a mixture of two 1D 
Gaussians (u = (—20, 20), m = (0.3, 0.7), 7 = (100, 100)), using a Gaussian proposal with variances of 
v € {1,500, 8}. (a) When v = 1, the chain gets trapped near the starting state and fails to sample from 
the mode at js = —20. (b) When v = 500, the chain is very “sticky”, so its effective sample size is low (as 
reflected by the rough histogram approximation at the end). (c) Using a variance of v = 8 is just right and 
leads to a good approximation of the true distribution (shown in red). Figure generated by mcmcGmmDemo. 
Based on code by Christophe Andrieu and Nando de Freitas. 


distribution, and furthermore, the acceptance rate is 100%. Of course, a 100% acceptance can 
trivially be achieved by using a proposal with variance 0 (assuming we start at a mode), but this 
is obviously not exploring the posterior. So having a high acceptance is not the ultimate goal. 
We can increase the amount of exploration by increasing the variance of the Gaussian kernel. 
Often one experiments with different parameters until the acceptance rate is between 25% and 
40%, which theory suggests is optimal, at least for Gaussian target distributions. These short 
initial runs, used to tune the proposal, are called pilot runs. 
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(a) (b) (œ) 


Figure 24.8 (a) Joint posterior of the parameters for 1d logistic regression when applied to some SAT data. 
(b) Marginal for the offset wo. (c) Marginal for the slope w1. We see that the marginals do not capture the 
fact that the parameters are highly correlated. Figure generated by logregSatMhDemo. 


Gaussian proposals 


If we have a continuous state space, the Hessian H at a local mode w can be used to define 
the covariance of a Gaussian proposal distribution. This approach has the advantage that the 
Hessian models the local curvature and length scales of each dimension; this approach therefore 
avoids some of the slow mixing behavior of Gibbs sampling shown in Figure 24.6. 

There are two obvious approaches: (1) an independence proposal, q(w’|w) = V(w’|w, H~') 
or (2), a random walk proposal, ¢(w’|w) = N (w'|w, s?H~*), where s? is a scale factor chosen 
to facilitate rapid mixing. (Roberts and Rosenthal 2001) prove that, if the posterior is Gaussian, 
the asymptotically optimal value is to use s? = 2.387/D, where D is the dimensionality of w; 
this results in an acceptance rate of 0.234. 

For example, consider MH for binary logistic regression. From Equation 8.7, we have that 
the Hessian of the log-likelihood is H; = XTDX, where D = diag(p;(1 — u;)) and ui = 
sigm(w7x;). If we assume a Gaussian prior, p(w) = (0, Vo), we have H = V3 ' + Hi, so 
the asymptotically optimal Gaussian proposal has the form 


; DG aui E zi 
q(w'|w) = N (w, (V3 +X’ DX) ) (24.53) 
See (Gamerman 1997; Rossi et al. 2006; Fruhwirth-Schnatter and Fruhwirth 2010) for further 
details. The approach is illustrated in Figure 24.8, where we sample parameters from a ld 
logistic regression model fit to some SAT data. We initialize the chain at the mode, computed 
using IRLS, and then use the above random walk Metropolis sampler. 
If you cannot afford to compute the mode or its Hessian XDX, an alternative approach, 


suggested in (Scott 2009), is to approximate the above proposal as follows: 


—1 
q(w'|w) =M (« (vo" ee x"x) (24.54) 


T2 
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Mixture proposals 


If one doesn’t know what kind of proposal to use, one can try a mixture proposal, which is a 
convex combination of base proposals: 


K 
q(x'|x) = X wegn(x’ |x) (24.55) 
k=1 


where w;, are the mixing weights. As long as each qķ is individually valid, the overall proposal 
will also be valid. 


Data-driven MCMC 


The most efficient proposals depend not just on the previous hidden state, but also the visible 
data, i.e., they have the form q(x'|x, D). This is called data-driven MCMC (see e.g., (Tu and 
Zhu 2002)). To create such proposals, one can sample (x,D) pairs from the forwards model 
and then train a discriminative classifier to predict p(x|f(D)), where f(D) are some features 
extracted from the visible data. 

Typically x is a high-dimensional vector (e.g., position and orientation of all the limbs of a 
person in a visual object detector), so it is hard to predict the entire state vector, p(x|f(D)). 
Instead we might train a discriminative detector to predict parts of the state-space, p(xz|fx(D)), 
such as the location of just the face of a person. We can then use a proposal of the form 


g(x/|x,D) = mogo(x’ x) + X> mrar (ztl f(D) (24.56) 
k 


where qo is a standard data-independent proposal (e.g., random walk), and q, updates the k’th 
component of the state space. For added efficiency, the discriminative proposals should suggest 
joint changes to multiple variables, but this is often hard to do. 

The overall procedure is a form of generate and test: the discriminative proposals q(x’ |x) 
Telp? 
see if the new hypothesis is better or worse. By adding an annealing step, one can modify 
the algorithm to find posterior modes; this is called simulated annealing, and is described in 


to 


generate new hypotheses, which are then “tested” by computing the posterior ratio 


Section 24.6.1. One advantage of using the mode-seeking version of the algorithm is that we do 
not need to ensure the proposal distribution is reversible. 


Adaptive MCMC 


One can change the parameters of the proposal as the algorithm is running to increase efficiency. 
This is called adaptive MCMC. This allows one to start with a broad covariance (say), allowing 
large moves through the space until a mode is found, followed by a narrowing of the covariance 
to ensure careful exploration of the region around the mode. 

However, one must be careful not to violate the Markov property; thus the parameters of the 
proposal should not depend on the entire history of the chain. It turns out that a sufficient 
condition to ensure this is that the adaption is “faded out” gradually over time. See e.g., (Andrieu 
and Thoms 2008) for details. 
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Initialization and mode hopping 


It is necessary to start MCMC in an initial state that has non-zero probability. If the model has 
deterministic constraints, finding such a legal configuration may be a hard problem in itself. It 
is therefore common to initialize MCMC methods at a local mode, found using an optimizer. 

In some domains (especially with discrete state spaces), it is a more effective use of computa- 
tion time to perform multiple restarts of an optimizer, and to average over these modes, rather 
than exploring similar points around a local mode. However, in continuous state spaces, the 
mode contains negligible volume (Section 5.2.1.3), so it is necessary to locally explore around 
each mode, in order to visit enough posterior probability mass. 


Why MH works * 


To prove that the MH procedure generates samples from p*, we have to use a bit of Markov 
chain theory, so be sure to read Section 17.2.3 first. 
The MH algorithm defines a Markov chain with the following transition matrix: 


q(x'|x)r(x'|x) if x Ax 


DERS { GX|X) + rex U(X’ |x) (1 — r(x'|x)) otherwise a 


This follows from a case analysis: if you move to x’ from x, you must have proposed it (with 
probability g(x’|x)) and it must have been accepted (with probability 1(x’|x)); otherwise you 
stay in state x, either because that is what you proposed (with probability q(x|x)), or because 
you proposed something else (with probability q(x’|x)) but it was rejected (with probability 
1 —r(x’|x)). 

Let us analyse this Markov chain. Recall from Section 17.2.3.4 that a chain satisfies detailed 
balance if 


p(x'|x)p* (x) = p(x|x")p*(x’) (24.58) 


We also showed that if a chain satisfies detailed balance, then p* is its stationary distribution. 
Our goal is to show that the MH algorithm defines a transition function that satisfies detailed 
balance and hence that p* is its stationary distribution. (If Equation 24.58 holds, we say that p* 
is an invariant distribution wrt the Markov transition kernel q.) 


Theorem 24.3.1. If the transition matrix defined by the MH algorithm (given by Equation 24.57) is 
ergodic and irreducible, then p* is its unique limiting distribution. 


Proof. Consider two states x and x’. Either 

p” (x)q(x'|x) < p*(x')a(x|x’) (24.59) 
or 

p (x)q(x'[x) > p*(x')a(x|x’) (24.60) 


We will ignore ties (which occur with probability zero for continuous distributions). Without loss 
of generality, assume that p* (x)q(x’|x) > p* (x’)q(x|x’). Hence 


r (x')a(x|x’) 


<i (24.61) 
p* (x)q(x'|x) 


a(x'|x) = 
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Hence we have r(x’|x) = a(x’|x) and r(x|x’) = 1. 
Now to move from x to x’ we must first propose x’ and then accept it. Hence 


A) _ PO) Ae) 


(x!) = a por o = o cng = art! (24.62) 
Hence 

p (x)p(x' |x) = p*(x')a(x|x’) (24.63) 
The backwards probability is 

p(xix’) = ax’)r xfx’) = axle’) (24.64) 


since r(x|x’) = 1. Inserting this into Equation 24.63 we get 
p (x)p(x' |x) = p*(x')p(x|x’) (24.65) 


so detailed balance holds wrt p*. Hence, from Theorem 17.2.3, p* is a stationary distribution. 
Furthermore, from Theorem 17.2.2, this distribution is unique, since the chain is ergodic and 
irreducible. 


Reversible jump (trans-dimensional) MCMC * 


Suppose we have a set of models with different numbers of parameters, e.g., mixture models in 
which the number of mixture components is unknown. Let the model be denoted by m, and 
let its unknowns (e.g., parameters) be denoted by Xm € Vn (eg, Xm = R”™, where nm is 
the dimensionality of model m). Sampling in spaces of differing dimensionality is called trans- 
dimensional MCMC (Green 2003). We could sample the model indicator m € {1,..., M} and 
sample all the parameters from the product space ne 4 Xm, but this is very inefficient. It is 
more parsimonious to sample in the union space ¥ = UM_,{m} x Xm, where we only worry 
about parameters for the currently active model. 

The difficulty with this approach arises when we move between models of different dimen- 
sionality. The trouble is that when we compute the MH acceptance ratio, we are comparing 
densities defined in different dimensionality spaces, which is meaningless. It is like trying to 
compare a sphere with a circle. The solution, proposed by (Green 1998) and known as reversible 
jump MCMC or RJMCMC, is to augment the low dimensional space with extra random variables 
so that the two spaces have a common measure. 

Unfortunately, we do not have space to go into details here. Suffice it to say that the method 
can be made to work in theory, although it is a bit tricky in practice. If, however, the continuous 
parameters can be integrated out (resulting in a method called collapsed RIMCMC), much of the 
difficulty goes away, since we are just left with a discrete state space, where there is no need 
to worry about change of measure. For example, (Denison et al. 2002) includes many examples 
of applications of collapsed RIMCMC applied to Bayesian inference fro adaptive basis-function 
models. They sample basis functions from a fixed set of candidates (e.g., centered on the data 
points), and integrate out the other parameters analytically. This provides a Bayesian alternative 
to using RVMs or SVMs. 
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Initial Condition X, = 10 Initial Condition X, = 17 

(x) p(x) 
10 15 20 10 15 20 

p(x) p(x) 
10 1 20 10 15 20 

p(x) p(x) 
10 1 20 10 15 20 

px) p(x) 
10 1 20 10 15 20 

p(x) px) 
10 1 20 10 15 20 


Figure 24.9 Illustration of convergence to the uniform distribution over {0, 1, . . . , 20} using a symmetric 
random walk starting from (left) state 10, and (right) state 17. Based on Figures 29.14 and 29.15 of (MacKay 
2003). Figure generated by randomWalk0to20Demo. 


Speed and accuracy of MCMC 


In this section, we discuss a number of important theoretical and practical issues to do with 
MCMC. 


The burn-in phase 


We start MCMC from an arbitrary initial state. As we explained in Section 17.2.3, only when the 
chain has “forgotten” where it started from will the samples be coming from the chain’s stationary 
distribution. Samples collected before the chain has reached its stationary distribution do not 
come from p*, and are usually thrown away. The initial period, whose samples will be ignored, 
is called the burn-in phase. 

For example, consider a uniform distribution on the integers {0,1,...,20}. Suppose we 
sample from this using a symmetric random walk. In Figure 24.9, we show two runs of the 
algorithm. On the left, we start in state 10; on the right, we start in state 17. Even in this small 
problem it takes over 100 steps until the chain has “forgotten” where it started from. 

It is difficult to diagnose when the chain has burned in, an issue we discuss in more detail 
below. (This is one of the fundamental weaknesses of MCMC.) As an interesting example of what 
can happen if you start collecting samples too early, consider the Potts model. Figure 24.10(a), 
shows a sample after 500 iterations of Gibbs sampling. This suggests that the model likes 
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(a) (b) 


Figure 24.10 Illustration of problems caused by poor mixing. (a) One sample from a 5-state Potts model 
on a 128 x 128 grid with 8 nearest neighbor connectivity and J = 2/3 (as in (Geman and Geman 1984), 
after 200 iterations. (b) One sample from the same model after 10,000 iterations. Used with kind permission 
of Erik Sudderth. 


medium-sized regions where the label is the same, implying the model would make a good 
prior for image segmentation. Indeed, this was suggested in the original Gibbs sampling paper 
(Geman and Geman 1984). 

However, it turns out that if you run the chain long enough, you get isolated speckles, as 
in Figure 24.10(b). The results depend on the coupling strength, but in general, it is very hard 
to find a setting which produces nice medium-sized blobs: most parameters result in a few 
super-clusters, or lots of small fragments. In fact, there is a rapid phase transition between these 
two regimes. This led to a paper called “The Ising/Potts model is not well suited to segmentation 
tasks” (Morris et al. 1996). It is possible to create priors more suited to image segmentation 
(e.g., (Sudderth and Jordan 2008)), but the main point here is that sampling before reaching 
convergence can lead to erroneous conclusions. 


Mixing rates of Markov chains * 


The amount of time it takes for a Markov chain to converge to the stationary distribution, and 
forget its initial state, is called the mixing time. More formally, we say that the mixing time 
from state xo is the minimal time such that, for any constant € > 0, we have that 


Te(ao) £ min{t : ||6,,(x)T* — p*||1 < €} (24.66) 


where 6,,,(x) is a distribution with all its mass in state xo, T is the transition matrix of the 
chain (which depends on the target p* and the proposal q), and 6,,,(2)T' is the distribution 
after t steps. The mixing time of the chain is defined as 


Te max Te(xo) (24.67) 
zo 


The mixing time is determined by the eigengap y = 1 — A2, which is the difference of the 
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Figure 24.11 A Markov chain with low conductance. The dotted arcs represent transitions with very low 
probability. Source: Figure 12.6 of (Koller and Friedman 2009). Used with kind permission of Daphne 
Koller. 


first and second eigenvalues of the transition matrix. In particular, one can show that 


OC tee (24.68) 
y € 
where n is the number of states. Since computing the transition matrix can be hard to do, 
especially for high dimensional and/or continuous state spaces, it is useful to find other ways to 
estimate the mixing time. 

An alternative approach is to examine the geometry of the state space. For example, consider 
the chain in Figure 24.11. We see that the state space consists of two “islands”, each of which 
is connected via a narrow “bottleneck”. (If they were completely disconnected, the chain would 
not be ergodic, and there would no longer be a unique stationary distribution.) We define the 
conductance ¢ of a chain as the minimum probability, over all subsets of states, of transitioning 
from that set to its complement: 


2 ocsam ES T(x => x’) 


£ j 24.69 
s:0<p* (5)<0.5 p*(S) ; l ) 
One can show that 
1 n 
T < O(log”) 24.70) 


Hence chains with low conductance have high mixing time. For example, distributions with 
well-separated modes usually have high mixing time. Simple MCMC methods often do not work 
well in such cases, and more advanced algorithms, such as parallel tempering, are necessary 
(see e.g., (Liu 2001). 


Practical convergence diagnostics 


Computing the mixing time of a chain is in general quite difficult, since the transition matrix 
is usually very hard to compute. In practice various heuristics have been proposed to diagnose 
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convergence — see (Geyer 1992; Cowles and Carlin 1996; Brooks and Roberts 1998) for a review. 
Strictly speaking, these methods do not diagnose convergence, but rather non-convergence. That 
is, the method may claim the chain has converged when in fact it has not. This is a flaw common 
to all convergence diagnostics, since diagnosing convergence is computationally intractable in 
general (Bhatnagar et al. 2010). 

One of the simplest approaches to assessing when the method has converged is to run 
multiple chains from very different overdispersed starting points, and to plot the samples of 
some variables of interest. This is called a trace plot. If the chain has mixed, it should have 
“forgotten” where it started from, so the trace plots should converge to the same distribution, 
and thus overlap with each other. 

Figure 24.12 gives an example. We show the traceplot for x which was sampled from a 
mixture of two 1D Gaussians using four different methods: MH with a symmetric Gaussian 
proposal of variance a? € {1,8,500}, and Gibbs sampling. We see that o? = 1 has not mixed, 
which is also evident from Figure 24.7(a), which shows that a single chain never leaves the area 
where it started. The results for the other methods indicate that the chains rapidly converge to 
the stationary distribution, no matter where they started. (The sticky nature of the o? = 500 
proposal is very evident. This reduces the computational efficiency, as we discuss below, but 
not the statistical validity.) 


Estimated potential scale reduction (EPSR) 


We can assess convergence more quantitatively as follows. The basic idea is to compare the 
variance of a quantity within each chain to its variance across chains. More precisely, suppose 
we collect S samples (after burn-in) from each of C chains of D variables, x;,., i = 1 : D, 
s=1:5,c=1:C. Let ys be a scalar quantity of interest derived from x1:D,s,e (e.g. 
Yse = Tise for some chosen i). Define the within-sequence mean and overall mean as 


i ee 
T.e È Gd Yor T S 5) De (24.71) 
a=] c=1 
Define the between-sequence and within-sequence variance as 
s <£ lae a 
BS——) @.-¥.), W225), |o ~%.)? 24.72 
Gai Ue EI, > gay De Vc) (24.72) 


We can now construct two estimates of the variance of y. The first estimate is W: this should 
underestimate var [y] if the chains have not ranged over the full posterior. The second estimate 
is 

S-1 1 


V = —_ `B 24.73 
V 5 W+g ( ) 


This is an estimate of var [y] that is unbiased under stationarity, but is an overestimate if the 
starting points were overdispersed (Gelman and Rubin 1992). From this, we can define the 
following convergence diagnostic statistic, known as the estimated potential scale reduction 
or EPSR: 


eee bs 
=4{/=> 24.74 
R wW (24.74) 
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Figure 24.12 Traceplots for MCMC samplers. Each color represents the samples from a different starting 
point. (a-c) MH with proposal MN (x'|x, 07) for o° € {1,8,500}, corresponding to Figure 24.7. (d) Gibbs 
sampling. Figure generated by mcmcGmmDemo. 


This quantity, which was first proposed in (Gelman and Rubin 1992), measures the degree to 
which the posterior variance would decrease if we were to continue sampling in the S — 
oo limit. If Ê ~ 1 for any given quantity, then that estimate is reliable (or at least is not 
unreliable). The Ê values for the four samplers in Figure 24.12 are 1.493, 1.039, 1.005 and 1.007. 
So this diagnostic has correctly identified that the sampler using the first (s? = 1) proposal is 
untrustworthy. 


Accuracy of MCMC 


The samples produced by MCMC are auto-correlated, and this reduces their information content 
relative to independent or “perfect” samples. We can quantify this as follows.’ Suppose we want 


4. This Section is based on (Hoff 2009, Sec 6.6). 
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Figure 24.13 Autocorrelation functions corresponding to Figure 24.12. Figure generated by mcmcGmmDemo. 


to estimate the mean of f(X), for some function f, where X ~ p(). Denote the true mean by 
f* SE(f(X)] (24.75) 


A Monte Carlo estimate is given by 


= PÈ 
f=3) f (24.76) 


=l 
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where fs = f(xs) and x, ~ p(x). An MCMC estimate of the variance of this estimate is given 


by 


Varuomclf] = E[(F- %7] (24.77) 
S 2 
= E | wer = m (24.78) 
s=1 
1 7 3 *\2 1 x $ * 
= gE a a EU a - PY (24.79) 
s=1 sAt 
= Varme) + ay DE [is -PA (24.80) 


where the first term is the Monte Carlo estimate of the variance if the samples weren't correlated, 
and the second term depends on the correlation of the samples. We can measure this as follows. 
Define the sample-based auto-correlation at lag t of a set of samples f),..., fs as follows: 


1. S-t _F Le 
m a TE Dam (fs Pf) (fete f) (24.81) 


a Dlh- FY 


This is called the autocorrelation function (ACF). This is plotted in Figure 24.13 for our four 
samplers for the Gaussian mixture model. We see that the ACF of the Gibbs sampler (bottom 
right) dies off to 0 much more rapidly than the MH samplers, indicating that each Gibbs sample 
is “worth” more than each MH sample. 

A simple method to reduce the autocorrelation is to use thinning, in which we keep every 
n'th sample. This does not increase the efficiency of the underlying sampler, but it does save 
space, since it avoids storing highly correlated samples. 

We can estimate the information content of a set of samples by computing the effective 
sample size (ESS) Sefi defined by 


grê Varmc(f) 
eff — \7........../F\ 
Varmomc(f) 


(24.82) 


From Figure 24.12, it is clear that the effective sample size of the Gibbs sampler is higher than 
that of the other samplers (in this example). 


How many chains? 


A natural question to ask is: how many chains should we run? We could either run one long 
chain to ensure convergence, and then collect samples spaced far apart, or we could run many 
short chains, but that wastes the burnin time. In practice it is common to run a medium 
number of chains (say 3) of medium length (say 100,000 steps), and to take samples from each 
after discarding the first half of the samples. If we initialize at a local mode, we may be able to 
use all the samples, and not wait for burn-in. 
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Model Goal Method Reference 

Probit MAP Gradient Section 9.4.1 

Probit MAP EM Section 11.4.6 

Probit Post EP (Nickisch and Rasmussen 2008) 

Probit Post Gibbs+ Exercise 24.6 

Probit Post Gibbs with ARS (Dellaportas and Smith 1993) 

Probit Post MH using IRLS proposal (Gamerman 1997) 

Logit MAP Gradient Section 8.3.4 

Logit Post Gibbs+ with Student (Fruhwirth-Schnatter and Fruhwirth 2010) 
Logit Post Gibbs+ with KS (Holmes and Held 2006) 


Table 24.1 Summary of some possible algorithms for estimation and inference for binary classification 
problems using Gaussian priors. Abbreviations: Aux. = auxiliary variable sampling, ARS = adaptive rejection 
sampling, EP = expectation propagation, Gibbs+ = Gibbs sampling with auxiliary variables, IRLS = iterative 
reweighted least squares, KS = Kolmogorov Smirnov, MAP = maximum a posteriori, MH = Metropolis 
Hastings, Post = posterior. 


Auxiliary variable MCMC * 


Sometimes we can dramatically improve the efficiency of sampling by introducing dummy 
auxiliary variables, in order to reduce correlation between the original variables. If the original 
variables are denoted by x, and the auxiliary variables by z, we require that $`, p(x, z) = p(x), 
and that p(x,z) is easier to sample from than just p(x). If we meet these two conditions, 
we can sample in the enlarged model, and then throw away the sampled z values, thereby 
recovering samples from p(x). We give some examples below. 


Auxiliary variable sampling for logistic regression 


In Section 9.4.2, we discussed the latent variable interpretation of probit regression. Recall that 


this had the form 


z £ w' x; +6 (24.83) 
ei ~ N(0,1) (24.84) 
y=1 = Ia 20) (24.85) 


We exploited this representation in Section 11.4.6, where we used EM to find an ML estimate. It 
is straightforward to convert this into an auxiliary variable Gibbs sampler (Exercise 24.6), since 
p(w|D) is Gaussian and p(z;|x;, Yi, w) is truncated Gaussian, both of which are easy to sample 
from. 
Now let us discuss how to derive an auxiliary variable Gibbs sampler for logistic regression. 
Let e; follow a logistic distribution, with pdf 
e ¢ 


PLogistic (€) = ETES (24.86) 


with mean E [e] = 0 and variance var [e] = 77/3. The cdf has the form F (e) = sigm(e), which 
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is the logistic function. Since y; = 1 iff wTx; + € > 0, we have, by symmetry, that 


foe) WwW Xi 

plyi = lx; w) = J f(e)de = / f(e)de = F(w? x;) = sigm(w? x;,) (24.87) 
—wl x; —oo 

as required. 

We can derive an auxiliary variable Gibbs sampler by sampling from p(z|w, D) and p(w|z, D). 
Unfortunately, sampling directly from p(w|z,D) is not possible. One approach is to define 
ei ~ N(0, Aj), where A; = (27);)? and y; ~ KS, the Kolmogorov Smirnov distribution, and then 
to sample w, z, A and w (Holmes and Held 2006). 

A simpler approach is to approximate the logistic distribution by the Student distribution 
(Albert and Chib 1993). Specifically, we will make the approximation e; ~ 7(0,1,v), where 
v & 8. We can now use the scale mixture of Gaussians representation of the Student to simplify 
inference. In particular, we write 


A ~ Ga(v/2,v/2) (24.88) 
ei ~ N(0,à7') (24.89) 
Ži 4 wx; + éi (24.90) 
yi =1\z, = I(z; >20) (24.91) 


All of the full conditionals now have a simple form; see Exercise 24.7 for the details. 

Note that if we set v = 1, then z; ~ VV (wTxi, 1), which is equivalent to probit regression (see 
Section 9.4). Rather than choosing between probit or logit regression, we can simply estimate 
the v parameter. There is no convenient conjugate prior, but we can consider a finite range of 
possible values and evaluate the posterior as follows: 


N 


1 PETENS 
pN xpt I agoa A (24.92) 
i=1 


Furthermore, if we define Vp = vol, we can sample vp as well. For example, suppose we use 
a IG(61, 52) prior for vo. The posterior is given by p(vo|w) = IG(d, + 4D, 52 + 5 jet w3). 
This can be interleaved with the other Gibbs sampling steps, and provides an appealing Bayesian 
alternative to cross validation for setting the strength of the regularizer. 

See Table 24.1 for a summary of various algorithms for fitting probit and logit models. Many 
of these methods can also be extended to the multinomial logistic regression case. For details, 
see (Scott 2009; Fruhwirth-Schnatter and Fruhwirth 2010). 


Slice sampling 


Consider sampling from a univariate, but multimodal, distribution p(x). We can sometimes 
improve the ability to make large moves by adding an auxiliary variable u. We define the joint 
distribution as follows: 


r _f W/Z, if0< u< p(z) 
B(x, u) = { 0 otherwise (24.93) 
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Figure 24.14 (a) Illustration of the principle behind slice sampling. Given a previous sample x’, we 
sample u+} uniformly on [0, f(x*)], where f is the target density. We then sample «‘*! along the slice 
where f(a) > u'*!. Source: Figure 15 of (Andrieu et al. 2003) . Used with kind permission of Nando de 
Freitas. (b) Slice sampling in action. Figure generated by sliceSamplingDemold. 
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Figure 24.15 Binomial regression for ld data. (a) Grid approximation to posterior. (b) Slice sampling 
approximation. Figure generated by sliceSamplingDemo2d. 


where Zp = f p(a)dx. The marginal distribution over x is given by 
P(e) 4 5 
n p(x) 
p(x, u)du = f —du = — = p(x) (24.94) 
i ee 
so we can sample from p(x) 
have the form 


by sampling from (x, u) and then ignoring u. The full conditionals 


P(ult) = Up.p(ay(u) (24.95) 
p(zju) = Ualz) (24.96) 
where A = {x : p(x) > u} is the set of points on or above the chosen height u. This 


corresponds to a slice through the distribution, hence the term slice sampling (Neal 2003a). 
See Figure 24.14(a). 


In practice, it can be difficult to identify the set A. So we can use the following approach: 
construct an interval £min < £ < Xmayz around the current point x° of some width. We then 
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test to see if each end point lies within the slice. If it does, we keep extending in that direction 
until it lies outside the slice. This is called stepping out. A candidate value x’ is then chosen 
uniformly from this region. If it lies within the slice, it is kept, so x°! = x’. Otherwise we 
shrink the region such that x’ forms one end and such that the region still contains x°. Then 
another sample is drawn. We continue in this way until a sample is accepted. 

To apply the method to multivariate distributions, we can sample one extra auxiliary variable 
for each dimension. The advantage of slice sampling over Gibbs is that it does not need 
a specification of the full-conditionals, just the unnormalized joint. The advantage of slice 
sampling over MH is that it does not need a user-specified proposal distribution (although it 
does require a specification of the width of the stepping out interval). 

Figure 24.14(b) illustrates the algorithm in action on a synthetic 1d problem. Figure 24.15 
illustrates its behavior on a slightly harder problem, namely binomial logistic regression. The 
model has the form 


We use a vague Gaussian prior for the (;’s. Figure 24.15(a) shows a grid-based approximation 
to the posterior, and Figure 24.15(b) shows a sample-based approximation. In this example, the 
grid is faster to compute, but for any problem with more than 2 dimensions, the grid approach 
is infeasible. 


Swendsen Wang 


Consider an Ising model of the following form: 
1 
MX)=FZ II fe(Xe) (24.98) 


where xe = (£i, xj) for edge e = (i, j), z; E {+1,—1}, and the edge factor fe is defined by 
J =J 

i PI , where J is the edge strength. Gibbs sampling in such models can be slow when 
J is large in absolute value, because neighboring states can be highly correlated. The Swendsen 
Wang algorithm (Swendsen and Wang 1987) is a auxiliary variable MCMC sampler which mixes 
much faster, at least for the case of attractive or ferromagnetic models, with J > 0. 

Suppose we introduce auxiliary binary variables, one per edge. ° These are called bond 
variables, and will be denoted by z. We then define an extended model p(x, z) of the form 


1 
p(x, 2) = z; | | gel%e, ze) (24.99) 


=J =J 
where ze € {0,1}, and we define the new factor as follows: ge(Xe, ze = 0) = (o a) 


Fo el 
and ge(Xe, Ze = 1) = q a eJ ka It is clear that Da Ge(Xe, Ze) = fe(Xe); 


5. Our presentation of the method is based on some notes by David Mackay, available from http: //www.inference 
_phy.cam.ac.uk/mackay/itila/swendsen.pdf. 
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Figure 24.16 Illustration of the Swendsen Wang algorithm on a 2d grid. Used with kind permission of 
Kevin Tang. 


and hence that $`, p(x, z) = p(x). So if we can sample from this extended model, we can just 
throw away the z samples and get valid x samples from the original distribution. 

Fortunately, it is easy to apply Gibbs sampling to this extended model. The full conditional 
p(z|x) factorizes over the edges, since the bond variables are conditionally independent given 
the node variables. Furthermore, the full conditional p(z.|x-) is simple to compute: if the 
nodes on either end of the edge are in the same state (x; = xj), we set the bond ze to 1 with 
probability p = 1 — e~?/, otherwise we set it to 0. In Figure 24.16 (top right), the bonds that 
could be turned on (because their corresponding nodes are in the same state) are represented 
by dotted edges. In Figure 24.16 (bottom right), the bonds that are randomly turned on are 
represented by solid edges. 

To sample p(x|z), we proceed as follows. Find the connected components defined by the 
graph induced by the bonds that are turned on. (Note that a connected component may consist 
of a singleton node.) Pick one of these components uniformly at random. All the nodes in each 
such component must have the same state, since the off-diagonal terms in the g. (xe, ze = 1) 
factor are 0. Pick a state +1 uniformly at random, and force all the variables in this component 
to adopt this new state. This is illustrated in Figure 24.16 (bottom left), where the green square 
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denotes the selected connected component, and we choose to force all nodes within in to enter 
the white state. 

The validity of this algorithm is left as an exercise, as is the extension to handle local evidence 
and non-stationary potentials. 

It should be intuitively clear that Swendsen Wang makes much larger moves through the state 
space than Gibbs sampling. In fact, SW mixes much faster than Gibbs sampling on 2d lattice 
Ising models for a variety of values of the coupling parameter, provided J > 0. More precisely, 
let the edge strength be parameterized by J/T, where T > 0 is a computational temperature. 
For large T, the nodes are roughly independent, so both methods work equally well. However, 
as T approaches a critical temperature T., the typical states of the system have very long 
correlation lengths, and Gibbs sampling takes a very long time to generate independent samples. 
As the temperature continues to drop, the typical states are either all on or all off. The frequency 
with which Gibbs sampling moves between these two modes is exponentiall small. By contrast, 
SW mixes rapidly at all temperatures. 

Unfortunately, if any of the edge weights are negative, J < 0, the system is frustrated, and 
there are exponentially many modes, even at low temperature. SW does not work very well in 
this setting, since it tries to force many neighboring variables to have the same state. In fact, 
computation in this regime is provably hard for any algorithm (Jlerrum and Sinclair 1993, 1996). 


Hybrid/Hamiltonian MCMC * 


In this section, we briefly mention a way to perform MCMC sampling for continuous state 
spaces, for which we can compute the gradient of the (unnormalized) log-posterior. This is the 
case in neural network models, for example. 

The basic idea is to think of the parameters as a particle in space, and to create auxiliary 
variables which represent the “momentum” of this particle. We then update this parameter/ 
momentum pair according to certain rules (see e.g., (Duane et al. 1987; Neal 1993; MacKay 2003; 
Neal 2010) for details). The resulting method is called hybrid MCMC or Hamiltonian MCMC. 
The two main parameters that the user must specify are how many leapfrog steps to take 
when updating the position/ momentum, and how big to make these steps. Performance can 
be quite sensitive to these parameters (although see (Hoffman and Gelman 2011) for a recent 
way to set them automatically). This method can be combined with stochastic gradient descent 
(Section 8.5.2) in order to handle large datasets, as explained in (Ahn et al. 2012). 

Recently, a more powerful extension of this method has been developed, that exploits second- 
order gradient information. See (Girolami et al. 2010) for details. 


Annealing methods 


Many distributions are multimodal and hence hard to sample from. However, by analogy to the 
way metals are heated up and then cooled down in order to make the molecules align, we can 
imagine using a computational temperature parameter to smooth out a distribution, gradually 
cooling it to recover the original “bumpy” distribution. We first explain this idea in more detail 
in the context of an algorithm for MAP estimation. We then discuss extensions to the sampling 
case. 
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Figure 24.17 An energy surface at different temperatures. Note the different vertical scales. (a) T = 1. 
(b) T = 0.5. Figure generated by saDemoPeaks. 


Simulated annealing 


Simulated annealing (Kirkpatrick et al. 1983) is a stochastic algorithm that attempts to find 
the global optimum of a black-box function f(x). It is closely related to the Metropolis- 
Hastings algorithm for generating samples from a probability distribution, which we discussed 
in Section 24.3. SA can be used for both discrete and continuous optimization. 

The method is inspired by statistical physics. The key quantity is the Boltzmann distribution, 
which specifies that the probability of being in any particular state x is given by 


p(x) x exp(—f(x)/T) (24.100) 


where f(x) is the “energy” of the system and T is the computational temperature. As the 
temperature approaches 0 (so the system is cooled), the system spends more and more time in 
its minimum energy (most probable) state. 

Figure 24.17 gives an example of a 2d function at different temperatures. At high temperatures, 
T > 1, the surface is approximately flat, and hence it is easy to move around (i.e., to avoid 
local optima). As the temperature cools, the largest peaks become larger, and the smallest peaks 
disappear. By cooling slowly enough, it is possible to “track” the largest peak, and thus find the 
global optimum. This is an example of a continuation method. 

We can generate an algorithm from this as follows. At each step, sample a new state according 
to some proposal distribution x’ ~ q(-|x;,). For real-valued parameters, this is often simply a 
random walk proposal, x’ = xz + €k, where €p ~ N (0, ©). For discrete optimization, other 
kinds of local moves must be defined. 

Having proposed a new state, we compute 


a = exp ((f(x) — f(x’))/T) (24.101) 


We then accept the new state (i.e., set X~41 = x’) with probability min(1, œ), otherwise we stay 
in the current state (i.e., set x,41 = Xk). This means that if the new state has lower energy (is 
more probable), we will definitely accept it, but it it has higher energy (is less probable), we might 
still accept, depending on the current temperature. Thus the algorithm allows “down-hill” moves 
in probability space (up-hill in energy space), but less frequently as the temperature drops. 
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Figure 24.18 A run of simulated annealing on the energy surface in Figure 24.17. (a) Temperature vs 
iteration. (b) Energy vs iteration. Figure generated by saDemoPeaks. 


iter 550, temp 0.064 iter 1000, temp 0.007 


Figure 24.19 Histogram of samples from the annealed “posterior” at 2 different time points produced by 
simulated annealing on the energy surface shown in Figure 24.17. Note that at cold temperatures, most of 
the samples are concentrated near the peak at (38,25). Figure generated by saDemoPeaks. 


The rate at which the temperature changes over time is called the cooling schedule. It 
has been shown (Kirkpatrick et al. 1983) that if one cools sufficiently slowly, the algorithm will 
provably find the global optimum. However, it is not clear what “sufficient slowly” means. 
In practice it is common to use an exponential cooling schedule of the following form: 
Ty = ToC", where Tọ is the initial temperature (often To ~ 1) and C is the cooling rate (often 
C ~ 0.8). See Figure 24.18(a) for a plot of this cooling schedule. Cooling too quickly means one 
can get stuck in a local maximum, but cooling too slowly just wastes time. The best cooling 
schedule is difficult to determine; this is one of the main drawbacks of simulated annealing. 

Figure 24.18(b) shows an example of simulated annealing applied to the function in Figure 24.17 
using a random walk proposal. We see that the method stochastically reduces the energy 
over time. Figures 24.19 illustrate (a histogram of) samples drawn from the cooled probability 
distribution over time. We see that most of the samples are concentrated near the global 
maximum. When the algorithm has converged, we just return the largest value found. 
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Annealed importance sampling 


We now describe a method known as annealed importance sampling (Neal 2001) that com- 
bines ideas from simulated annealing and importance sampling in order to draw independent 
samples from difficult (e.g., multimodal) distributions. 

Suppose we want to sample from po(x) « fo(x), but we cannot do so easily; for example, 
this might represent a multimodal posterior. Suppose however that there is an easier distribution 
which we can sample from, call it p,(x) « fn(x); for example, this might be the prior. We 
can now construct a sequence of intermediate distributions than move slowly from pn to po as 
follows: 


HE) = PP ney (24.102) 


where 1 = bo > 6; > --- > Bn = 0, where p; is an inverse temperature. (Contrast this to the 
scheme used by simulated annealing which has the form f;(x) = fo(x)°/; this makes it hard 
to sample from pn.) Furthermore, suppose we have a series of Markov chains T;(x,x’) (from x 
to x’) which leave each p; invariant. Given this, we can sample x from po by first sampling a 
sequence Z = (Zn—1,. . - , Zo) as follows: sample z,-1 ~ pn; sample Zn-2 ~ Th—1(Zn—1,°)} =; 
sample zo ~ T\(Z1,-). Finally we set x = Zo and give it weight 


w= tr (Zn 1) Tr 2(Zn 2) a fi(z1) fo (Zo) 
Falai) fn—1(Zn—2) folz) fı(zo) 


This can be shown to be correct by viewing the algorithm as a form of importance sampling 


(24.103) 


in an extended state space Z = (Zọ, . . . , Zn—1). Consider the following distribution on this state 
space: 
p(z) X f(z) = fo(Z0) Ti (zo, z1)T> (Z1, Z2) ki Ta-1(Zn-2, Zn—1) (24.104) 


where T; is the reversal of T}: 


T;(z,2') = T;(2',2)p;(2')/p;(2) = T;(2’, 2) f;(z')/ f(z) (24.105) 


It is clear that 33 og, 4 
sequences to recover the original ditribution. 
Now consider the proposal distribution defined by the algorithm: 


q(z) « g(2) = fn(Zn—1)Tn—1(Zn—1, Zn—2) ++» To(Za, 21)T1 (Z1, Zo) (24.106) 


f(Zo,-+-;2n—1) 
g(Zo,-+5 Zn—1) 


f(z) = fo(Zo), so we can safely just use the zo part of these 


One can show that the importance weights w = are given by Equation 24.103. 


Parallel tempering 


Another way to combine MCMC and annealing is to run multiple chains in parallel at different 
temperatures, and allow one chain to sample from another chain at a neighboring temperature. 
In this way, the high temperature chain can make long distance moves through the state space, 
and have this influence lower temperature chains. This is known as parallel tempering. See 
e.g., (Earl and Deem 2005) for details. 
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Approximating the marginal likelihood 


The marginal likelihood p(D|M) is a key quantity for Bayesian model selection, and is given by 
p(D|M) = J vDo, moma (24.107) 


Unfortunately, this integral is often intractable to compute, for example if we have non conjugate 
priors, and/or we have hidden variables. In this section, we briefly discuss some ways to 
approximate this expression using Monte Carlo. See (Gelman and Meng 1998) for a more 
extensive review. 


The candidate method 


There is a simple method for approximating the marginal likelihood known as the Candidate 
method (Chib 1995). This exploits the following identity: 


p(D|0, M)p(0|M) 


p(D|M) = ai, M) 


(24.108) 


This holds for any value of 8. Once we have picked some value, we can evaluate p(D|0, M) 
and p(0|M) quite easily. If we have some estimate of the posterior near 0, we can then evaluate 
the denominator as well. This posterior is often approximated using MCMC. 

The flaw with this method is that it relies on the assumption that p(@|D, M) has marginalized 
over all the modes of the posterior, which in practice is rarely possible. Consequently the method 
can give very inaccurate results in practice (Neal 1998). 


Harmonic mean estimate 


Newton and Raftery (1994) proposed a simple method for approximating p(D) using the output 
of MCMC, as follows: 


S 
1 1 
1/p(D) = z 2, D0 (24.109) 


where 0° ~ p(@|D). This expression is the harmonic mean of the likelihood of the data under 
each sample. The theoretical correctness of this expression follows from the following identity: 


1 z 1 p(D\A)p(@) ,, 1 S S 
| zamena = | oom, (D) do = — | »(0ID)a0 WD) (24.110) 


Unfortunately, in practice this method works very poorly. Indeed, Radford Neal called this “the 
worst Monte Carlo method ever”.®. The reason it is so bad is that it depends only on samples 
drawn from the posterior. But the posterior is often very insensitive to the prior, whereas the 
marginal likelihood is not. We only mention this method in order to warn against its use. We 
present a better method below. 


6. Source: radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-mon 
te-carlo-method-ever. 
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Annealed importance sampling 


We can use annealed importance sampling (Section 24.6.2) to evaluate a ratio of partition 
functions. Notice that Zo = f fo(x)dx = f f(z)dz, and Z, = f fn(x)dx = f g(z)dz. Hence 


Zo _ J fa)da _ Ao ah e 1%, 
Za Jogæjdz  fgzjdz sn o= s (24111) 


If fn is a prior and fo is the posterior, we can estimate Zn = p(D) using the above equation, 
provided the prior has a known normalization constant Zp. This is generally considered the 
method of choice for evaluating difficult partition functions. 


Exercises 


Exercise 24.1 Gibbs sampling from a 2D Gaussian 


Suppose x ~ N (u, ©), where u = (1,1) and © = (1,—0.5;—0.5,1). Derive the full condition- 
als p(xı|x2) and p(x2|x1). Implement the algorithm and plot the 1d marginals p(xı) and p(x2) as 
histograms. Superimpose a plot of the exact marginals. 


Exercise 24.2 Gibbs sampling for a 1D Gaussian mixture model 


Consider applying Gibbs sampling to a univariate mixture of Gaussians, as in Section 24.2.3. Derive the 
expressions for the full conditionals. Hint: if we know zn = j (say), then u; gets “connected” to £n, but 
all other values of ui, for all i A j, are irrelevant. (This is an example of context-specific independence, 
where the structure of the graph simplifies once we have assigned values to some of the nodes.) Hence, 
given all the zn values, the posteriors of the ju’s should be independent, so the conditional of u; should 
be independent of p_;. (Similarly for øj.) 


Exercise 24.3 Gibbs sampling from the Potts model 


Modify the code in gibbsDemoIsing to draw samples from a Potts prior at different temperatures, as in 
Figure 19.8. 


Exercise 24.4 Full conditionals for hierarchical model of Gaussian means 
Let us reconsider the Gaussian-Gaussian model in Section 5.6.2 for modelling multiple related mean 
parameters 0;. In this exercise we derive a Gibbs sampler instead of using EB. Suppose, following (Hoff 
2009, p134), that we use the following conjugate priors on the hyper-parameters: 
u ~ Nu, 76) (24.112) 
T? ~  IG(no/2, 079 /2) (24.113) 
o? ~ IG(v9/2,v005/2) (24.114) 
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We can set n = (Ho, Yo, No, To, Vo, co) to uninformative values. Given this model specification, show that 
the full conditionals for u, T, o and the 0; are as follows: 


2e es DO/7? T to /'V6 2i 27-1 

P(H\O1:D,T N (ul DIEI’ [D/T" +1/%] ) (24.115 
; Qa, .2) N,@;/0? + 1/7? je os 4 
P(9; |", T ,Dj,0 = N (8; N;/o? + 1/7? »[Nj/o +1/r ] ) (24.116 
T2 J (= 2 

oeni. = a BF? NOTO a j-p) ) (24.117 

1 2 1 ae 
p(o?|O1:5,D) = IG(o*| 5 [v0 + D Nj], 5 oco + 2 de — 6;)°]) (24.118 

I= Jalis 


Exercise 24.5 Gibbs sampling for robust linear regression with a Student t likelihood 


Modify the EM algorithm in Exercise 11.12 to perform Gibbs sampling for p(w, 0°, z|D, v). 


Exercise 24.6 Gibbs sampling for probit regression 

Modify the EM algorithm in Section 11.4.6 to perform Gibbs sampling for p(w,z|D). Hint: we can 
sample from a truncated Gaussian, N (z|u,o)I(a < z < b) in two steps: first sample u ~ U(®((a — 
u)/o), ®((b — u)/c)), then set z = p+ o~t (u) (Robert 1995). 

Exercise 24.7 Gibbs sampling for logistic regression with the Student approximation 

Derive the full conditionals for the joint model defined by Equations 24.88 to 24.91. 
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Clustering 


Introduction 


Clustering is the process of grouping similar objects together. There are two kinds of inputs we 
might use. In similarity-based clustering, the input to the algorithm is an N x N dissimilarity 
matrix or distance matrix D. In feature-based clustering, the input to the algorithm is an 
N x D feature matrix or design matrix X. Similarity-based clustering has the advantage that it 
allows for easy inclusion of domain-specific similarity or kernel functions (Section 14.2). Feature- 
based clustering has the advantage that it is applicable to “raw”, potentially noisy data. We will 
see examples of both below. 

In addition to the two types of input, there are two possible types of output: flat cluster- 
ing, also called partitional clustering, where we partition the objects into disjoint sets; and 
hierarchical clustering, where we create a nested tree of partitions. We will discuss both of 
these below. Not surprisingly, flat clusterings are usually faster to create (O(N D) for flat vs 
O(N 2 log N ) for hierarchical), but hierarchical clusterings are often more useful. Furthermore, 
most hierarchical clustering algorithms are deterministic and do not require the specification of 
K, the number of clusters, whereas most flat clustering algorithms are sensitive to the initial 
conditions and require some model selection method for K. (We will discuss how to choose K 
in more detail below.) 

The final distinction we will make in this chapter is whether the method is based on a 
probabilistic model or not. One might wonder why we even bother discussing non-probabilistic 
methods for clustering. The reason is two-fold: first, they are widely used, so readers should 
know about them; second, they often contain good ideas, which can be used to speed up 
inference in a probabilistic models. 


Measuring (dis)similarity 


A dissimilarity matrix D is a matrix where d;,; = 0 and d;,; > 0 is a measure of “distance” 
between objects 7 and j. Subjectively judged dissimilarities are seldom distances in the strict 
sense, since the triangle inequality, d; ; < di œ + dj k, often does not hold. Some algorithms 
require D to be a true distance matrix, but many do not. If we have a similarity matrix S, we 
can convert it to a dissimilarity matrix by applying any monotonically decreasing function, e.g., 
D = max(S) —S. 


The most common way to define dissimilarity between objects is in terms of the dissimilarity 
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of their attributes: 
D 
A(x, xi) = X. Aj (aig, Tij) (25.1) 
j=l 


Some common attribute dissimilarity functions are as follows: 


e Squared (Euclidean) distance: 
A; (wij, Tij) = (wig — Tii)? (25.2) 
Of course, this only makes sense if attribute j is real-valued. 


e Squared distance strongly emphasizes large differences (because differences are squared). A 
more robust alternative is to use an 44 distance: 


Ag (Tijs Lij) = ey — Bey (25.3) 


This is also called city block distance, since, in 2D, the distance can be computed by 
counting how many rows and columns we have to move horizontally and vertically to get 
from x; to xy. 


e If x; is a vector (e.g., a time-series of real-valued data), it is common to use the correlation 
coefficient (see Section 2.5.1). If the data is standardized, then corr [x;, xy] = ay Bizti 
and hence )), (xij — rj)” = 2(1 — corr [x;,x;]). So clustering based on correlation 
(similarity) is equivalent to clustering based on squared distance (dissimilarity). 


e For ordinal variables, such as {low, medium, high}, it is standard to encode the values as 
real-valued numbers, say 1/3, 2/3,3/3 if there are 3 possible values. One can then apply 
any dissimilarity function for quantitative variables, such as squared distance. 


e For categorical variables, such as {red, green, blue}, we usually assign a distance of 1 if the 
features are different, and a distance of 0 otherwise. Summing up over all the categorical 
features gives 


D 
A(xi, Xi) =X ( (i A Bug) (25.4) 
j=1 


This is called the hamming distance. 


Evaluating the output of clustering methods * 


The validation of clustering structures is the most difficult and frustrating part of cluster 
analysis. Without a strong effort in this direction, cluster analysis will remain a black art 
accessible only to those true believers who have experience and great courage. — Jain 
and Dubes (Jain and Dubes 1988) 
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AAA ABB 
AAB BBC 


Figure 25.1 Three clusters with labeled objects inside. Based on Figure 16.4 of (Manning et al. 2008). 


Clustering is an unupervised learning technique, so it is hard to evaluate the quality of the output 
of any given method. If we use probabilistic models, we can always evaluate the likelihood of 
a test set, but this has two drawbacks: first, it does not directly assess any clustering that is 
discovered by the model; and second, it does not apply to non-probabilistic methods. So now 
we discuss some performance measures not based on likelihood. 

Intuitively, the goal of clustering is to assign points that are similar to the same cluster, 
and to ensure that points that are dissimilar are in different clusters. There are several ways 
of measuring these quantities e.g., see Jain and Dubes 1988; Kaufman and Rousseeuw 1990). 
However, these internal criteria may be of limited use. An alternative is to rely on some external 
form of data with which to validate the method. For example, suppose we have labels for each 
object, as in Figure 25.1. (Equivalently, we can have a reference clustering; given a clustering, we 
can induce a set of labels and vice versa.) Then we can compare the clustering with the labels 
using various metrics which we describe below. We will use some of these metrics later, when 
we compare clustering methods. 


Purity 


Let N;j be the number of objects in cluster i that belong to class j, and let N; = ae Ni; be 
the total number of objects in cluster i. Define p;; = N;;/Nj; this is the empirical distribution 


over class labels for cluster i. We define the purity of a cluster as p; = max; p;j, and the 
overall purity of a clustering as 


N; 
purity 4 ‘2 wei (25.5) 


For example, in Figure 25.1, we have that the purity is 
SPT E TE ee 
176 176 175° 17 


T 
The purity ranges between 0 (bad) and 1 (good). However, we can trivially achieve a purity of 
1 by putting each object into its own cluster, so this measure does not penalize for the number 


= 0.71 (25.6) 


of clusters. 


Rand index 


Let U = {u1,... ur} and V = {v1,..., Vc} be two different partitions of the N data points, 
i.e. two different (flat) clusterings. For example, U might be the estimated clustering and V 
is reference clustering derived from the class labels. Now define a 2 x 2 contingency table, 
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containing the following numbers: TP is the number of pairs that are in the same cluster in 
both U and V (true positives); TN is the number of pairs that are in the different clusters in 
both U and V (true negatives); FN is the number of pairs that are in the different clusters in 
U but the same cluster in V (false negatives); and FP is the number of pairs that are in the 
same cluster in U but different clusters in V (false positives). A common summary statistic is 
the Rand index: 


& TP+TN 
~ TP+FP+FN+TN 


(25.7) 


This can be interpreted as the fraction of clustering decisions that are correct. Clearly 0 < R < 
1. 

For example, consider Figure 25.1, The three clusters contain 6, 6 and 5 points, so the number 
of “positives” (i.e., pairs of objects put in the same cluster, regardless of label) is 


6 6 5 
TP FPS | + (5) + o) = 40 (25.8) 


Of these, the number of true positives is given by 


OORO a» 


where the last two terms come from cluster 3: there are (3) pairs labeled C and o) pairs 


2 
labeled A. So FP = 40 — 20 = 20. Similarly, one can show FN = 24 and TN = 72. So the 
Rand index is (20 + 72)/(20 + 20 + 24 + 72) = 0.68. 
The Rand index only achieves its lower bound of 0 if TP = TN = 0, which is a rare event. 
One can define an adjusted Rand index (Hubert and Arabie 1985) as follows: 
index — expected index 


ARS (25.10) 


max index — expected index 


Here the model of randomness is based on using the generalized hyper-geometric distribution, 
i.e. the two partitions are picked at random subject to having the original number of classes 
and objects in each, and then the expected value of TP + TN is computed. This model can 
be used to compute the statistical significance of the Rand index. 

The Rand index weights false positives and false negatives equally. Various other summary 
statistics for binary decision problems, such as the F-score (Section 5.7.2.2), can also be used. 
One can compute their frequentist sampling distribution, and hence their statistical significance, 
using methods such as bootstrap. 


Mutual information 


Another way to measure cluster quality is to compute the mutual information between U and 
V (Vaithyanathan and Dom 1999). To do this, let pyy (i, j) = [wifi be the probability that 
a randomly chosen object belongs to cluster u; in U and v; in V. Also, let py (i) = |ui|/N 
be the be the probability that a randomly chosen object belongs to cluster u; in U; define 
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pv(j) = as ab Then we have 


a ae i, j) log PUY I) (25.1) 


a pu (i)pv (j) 


This lies between 0 and min{H(U),H(V)}. Unfortunately, the maximum value can be 
achieved by using lots of small clusters, which have low entropy. To compensate for this, 
we can use the normalized mutual information, 


IU,V) 
(H (U) + H(V))/2 
This lies between 0 and 1. A version of this that is adjusted for chance (under a particular 


random data model) is described in (Vinh et al. 2009). Another variant, called variation of 
information, is described in (Meila 2005). 


NMI(U,V) £ (25.12) 


Dirichlet process mixture models 


The simplest approach to (flat) clustering is to use a finite mixture model, as we discussed in 
Section 11.2.3. This is sometimes called model-based clustering, since we define a probabilistic 
model of the data, and optimize a well-defined objective (the likelihood or posterior), as opposed 
to just using some heuristic algorithm. 

The principle problem with finite mixture models is how to choose the number of components 
K. We discussed several techniques in Section 11.5. However, in many cases, there is no well- 
defined number of clusters. Even in the simple 2d height-weight data (Figure 1.8), it is not clear 
if the “correct” value of K should be 2, 3, or 4. It would be much better if we did not have to 
choose K at all. 

In this section, we discuss infinite mixture models, in which we do not impose any a priori 
bound on K. To do this, we will use a non-parametric prior based on the Dirichlet process 
(DP). This allows the number of clusters to grow as the amount of data increases. It will also 
prove useful later when we discuss hiearchical clustering. 

The topic of non-parametric Bayes is currently very active, and we do not have space to 
go into details (see (Hjort et al. 2010) for a recent book on the topic). Instead we just give a 
brief review of the DP and its application to mixture modeling, based on the presentation in 
(Sudderth 2006, sec 2.2). 


From finite to infinite mixture models 


Consider a finite mixture model, as shown in Figure 25.2(a). The usual representation is as 
follows: 


p(xilzi =k,0) = p(xi|Ox) (25.13) 
plz = kr) = my (25.14) 
plrja) = Dir(a|(a/K)1K) (25.15) 


The form of p(@;|A) is chosen to be conjugate to p(x;|0;). We can write p(x;|0;,) as x; ~ 
F'(0,,), where F is the observation distribution. Similarly, we can write 0; ~ H(A), where H 
is the prior. 
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Figure 25.2 Two different representations of a finite mixture model. Left: traditional representation. 
Right: representation where parameters are samples from G, a discrete measure. The picture on the right 
illustrates the case where K = 4, and we sample 4 Gaussian means 6; from a Gaussian prior H (.|A). The 
height of the spikes reflects the mixing weights m. This weighted sum of delta functions is G. We then 
generate two parameters, 6, and 02, from G, one per data point. Finally, we generate two data points, 
xı and x2, from N (61, a”) and N (02, o°). Source: Figure 2.9 of (Sudderth 2006) . Used with kind 
permission of Erik Sudderth. 


iall 
tr 


(a) 


ICES 
ak 


An equivalent representation for this model is shown in Figure 25.2(b). Here 0; is the 
parameter used to generate observation x;; these parameters are sampled from distribution G, 
which has the form 


K 
0) = X Trõo, (0) (25.16) 


where m ~ Dir( 1), and 0y ~ H. Thus we see that G is a finite mixture of delta functions, 


centered on the cluster parameters 0%. The probability that 0; is equal to 6; is exactly my, the 
prior probability for that cluster. 

If we sample from this model, we will always (with probability one) get exactly K clusters, 
with data points scattered around the cluster centers. We would like a more flexible model, 
that can generate a variable number of clusters. Furthermore, the more data we generate, the 
more likely we should be to see a new cluster. The way to do this is to replace the discrete 
distribution G with a random probability measure. Below we will show that the Dirichlet 
process, denoted G ~ DP(a, H), is one way to do this. 

Before we go into the details, we show some samples from this non-parametric model in 
Figure 25.3. We see that it has the desired properties of generating a variable number of clusters, 
with more clusters as the amount of data increases. The resulting samples look much more like 
real data than samples from a finite mixture model. 

Of course, working with an “infinite” model sounds scary. Fortunately, as we show below, 
even though this model is potentially infinite, we can perform inference using an amount of 
computation that is not only tractable, but is often much less than that required to fit a set 
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Figure 25.3 Some samples from a Dirichlet process mixture model of 2D Gaussians, with concentration 
parameter a = 1. From left to right, we show N = 50, N = 500 and N = 1000 samples. Each row is a 
different run. We also show the model parameters as ellipses, which are sampled from a vague NIW base 
distribution. Based on Figure 2.25 of (Sudderth 2006). Figure generated by dpmSampleDemo, written by 
Yee-Whye Teh. 


of finite mixture models for different K. The intuitive reason is that we can get evidence that 
certain values of K are appropriate (have high posterior support) long before we have been able 
to estimate the parameters, so we can focus our computational efforts on models of appropriate 
complexity. Thus going to the infinite limit can sometimes be faster. This is especially true 
when we have multiple model selection problems to solve. 
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(a) (b) (©) 


Figure 25.4 (a) A base measure H on a 2d space ©. (b) One possible partition into K = 3 regions, 
where the shading of cell T;, is proportional to E [G(T})] = H (Tk). (c) A refined partition into K = 5 
regions. Source: Figure 2.21 of (Sudderth 2006). Used with kind permission of Erik Sudderth. 


The Dirichlet process 


Recall from Chapter 15 that a Gaussian process is a distribution over functions of the form 
f : Æ — R. It is defined implicitly by the requirement that p(f(x1),..., f(xw)) be jointly 
Gaussian, for any set of points x; € X. The parameters of this Gaussian can be computed using 
a mean function y() and covariance (kernel) function K(). We write f ~ GP(u(), K()). Fur- 
thermore, the GP is consistently defined, so that p(f(x1)) can be derived from p(f (x1), f(x2)), 
etc. 

A Dirichlet process is a distribution over probability measures G : © — Rt, where we 
require G(#) > 0 and fẹ G(0)d0 = 1. The DP is defined implicitly by the requirement that 
(G(T), ...,G(Tg)) has a joint Dirichlet distribution 


Dir(@H(T;),...,aH(TK)) (25.17) 


for any finite partition (T),...,7%) of O. If this is the case, we write G ~ DP(a, H), where 
a is called the concentration parameter and H is called the base measure! 

An example of a DP is shown in Figure 25.4, where the base measure is a 2d Gaussian. The 
distribution over all the cells, p(G(T1),...,G(ZK)), is Dirichlet, so the marginals in each cell 
are beta distributed: 


Beta(aH(T;),a X. H(T;)) (25.18) 
j#i 
The DP is consistently defined in the sense that if T} and T> form a partition of Ti; then 
G(T) + G(T) and G(T) both follow the same beta distribution. 


Recall that if m ~ Dir(a), and z|m ~ Cat(a), then we can integrate out 7 to get the 
predictive distribution for the Dirichlet-multinoulli model: 


z ~ Cat(ai/ao,...,aK/ao) (25.19) 
1. Unlike a GP, knowing something about G(Tp) does not tell us anything about G(Tj,), beyond the sum-to-one 


constraint; we say that the DP is a neutral process. Other stochastic processes can be defined that do not have this 
property, but they are not so computationally convenient. 
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Figure 25.5 Illustration of the stick breaking construction. (a) We have a unit length stick, which we 
break at a random point 61; the length of the piece we keep is called 71; we then recursively break off 
pieces of the remaining stick, to generate 72, 73,.... Source: Figure 2.22 of (Sudderth 2006). Used with 
kind permission of Erik Sudderth. (b) Samples of 7; from this process for œa = 2 (top row) and a = 5 
(bottom row). Figure generated by stickBreakingDemo, written by Yee-Whye Teh. 


where ao = >>; ax. In other words, p(z = k|a) = ak/ao. Also, the updated posterior for 7 
given one observation is given by 


mlz ~ Dir(ay +1(z = 1),... agx + I(z = K)) (25.20) 


The DP generalizes this to arbitrary partitions. If G ~ DP(a, H), then p(@ € T;) = H(T;) and 
the posterior is 


p(G(N),...,G(Tk)|@, a, H) = Dir(aH (T1) + 1(0 € T,),...,aH (Tr) + 1(6 € Tg )}25.2) 


This holds for any set of partitions. Hence if we observe multiple samples 0; ~ G, the new 
posterior is given by 


N 
= - 1 
G|0,,...,9v,a,H ~ DP (arma Ga) (25.22) 


Thus we see that the DP effectively defines a conjugate prior for arbitrary measurable spaces. 
The concentration parameter a is like the effective sample size of the base measure H. 


Stick breaking construction of the DP 


Our discussion so far has been very abstract. We now give a constructive definition for the DP, 
known as the stick-breaking construction. 
Let m = {7,}?2, be an infinite sequence of mixture weights derived from the following 


process: 


Bp ~ Beta(1,a) (25.23) 
k-1 


k-1 
me = Be[[Q-6)=%(- dom) (25.24) 
l=1 


l=1 
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This is often denoted by 
m ~ GEM(a) (25.25) 


where GEM stands for Griffiths, Engen and McCloskey (this term is due to (Ewens 1990)). Some 
samples from this process are shown in Figure 25.5. One can show that this process process 
will terminate with probability 1, although the number of elements it generates increases with 


a. Furthermore, the size of the 7; components decreases on average. 
Now define 


G(0) = X` x50, (8) (25.26) 
k=1 


where 7 ~ GEM(q) and 0; ~ H. Then one can show that G ~ DP(a, H). 

As a consequence of this construction, we see that samples from a DP are discrete with 
probability one. In other words, if you keep sampling it, you will get more and more repetitions 
of previously generated values. So if we sample 0; ~ G, we will see repeated values; let us 
number the unique values 01, 02, etc. Data sampled from 8; will therefore cluster around the 
Op. This is evident in Figure 25.3, where most data comes from the Gaussians with large 7, 
values, represented by ellipses with thick borders. This is our first indication that the DP might 
be useful for clustering. 


The Chinese restaurant process (CRP) 


Working with infinite dimensional sticks is problematic. However, we can exploit the clustering 
property to draw samples form a GP, as we now show. 

The key result is this: If 0; ~ G are N observations from G ~ DP(a, H), taking on K 
distinct values 0%, then the predictive distribution of the next observation is given by 


K 
= = 1 
p(On+1 = 6|61.N, Q, H) = aoe (eno + 2 Neba, o) (25.27) 


where NV; is the number of previous observations equal to Ox. This is called the Polya urn or 
Blackwell-MacQueen sampling scheme. This provides a constructive way to sample from a DP. 

It is much more convenient to work with discrete variables z; which specify which value of 
0, to use. That is, we define 0; = 0,,. Based on the above expression, we have 


K 

1 

plzn+1 = 2|Z1:N, a) = aN (ae = k*)+ 5 N;I(z = v) (25.28) 
k=1 


where k* represents a new cluster index that has not yet been used. This is called the Chinese 
restaurant process or CRP, based on the seemingly infinite supply of tables at certain Chinese 
restaurants. The analogy is as follows: The tables are like clusters, and the customers are like 
observations. When a person enters the restaurant, he may choose to join an existing table with 
probability proportional to the number of people already sitting at this table (the Nx); otherwise, 
with a probability that diminishes as more people enter the room (due to the 1/(œ + N) term), 
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Figure 25.6 Two views of a DP mixture model. Left: infinite number of clusters parameters, 0;, and 
a ~ GEM(a). Right: G is drawn from a DP. Compare to Figure 25.2. Source: Figure 2.24 of (Sudderth 
2006). Used with kind permission of Erik Sudderth. 


he may choose to sit at a new table k*. The result is a distribution over partitions of the 
integers, which is like a distribution of customers to tables. 

The fact that currently occupied tables are more likely to get new customers is sometimes 
called the rich get richer phenomenon. Indeed, one can derive an expression for the distri- 
bution of cluster sizes induced by this prior process; it is basically a power law. The number 
of occupied tables K almost surely approaches alog(NV) as N — co, showing that the model 
complexity will indeed grow logarithmically with dataset size. More flexible priors over cluster 
sizes can also be defined, such as the two-parameter Pitman-Yor process. 


Applying Dirichlet processes to mixture modeling 


The DP is not particularly useful as a model for data directly, since data vectors rarely repeat 
exactly. However, it is useful as a prior for the parameters of a stochastic data generating 
mechanism, such as a mixture model. To create such a model, we follow exactly the same setup 
as Section 11.2, but we define G ~ DP(a, H). Equivalently, we can write the model as follows: 


m ~ GEM(a) (25.29) 
ae ae R (25.30) 
6, ~ H(A) (25.31) 
x; ~ F(0,) (25.32) 


This is illustrated in Figure 25.6. We see that G is now a random draw of an unbounded number 
of parameters @;, from the base distribution H, each with weight mọ. Each data point x; is 
generated by sampling its own “private” parameter 0; from G. As we get more and more data, 
it becomes increasingly likely that 0; will be equal to one of the 0;’s we have seen before, and 
thus x; will be generated close to an existing datapoint. 
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Fitting a DP mixture model 


The simplest way to fit a DPMM is to modify the collapsed Gibbs sampler of Section 24.2.4. 
From Equation 24.23 we have 


plzi = k\z_i,x,a,A) œ plz; = klz_;,a)p(xi|x_-i, zi = k, zi, A) (25.33) 
By exchangeability, we can assume that z; is the last customer to enter the restaurant. Hence 
the first term is given by 

K 


p( zi |Z—1, a) = PE (a = k*) + 5 Ng —il(zi = v) (25.34) 


a+N-1 E 


where K is the number of clusters used by z_;, and k* is a new cluster. Another way to write 
this is as follows: 


p(z =klz4,a) = SNTL (25.35) 


if k is a new cluster 


Ngair 
k—i_ jif k has been seen before 
a+N—1 


Interestingly, this is equivalent to Equation 24.26, which has the form p(z; = k|z_-;,a) = 


mee in the K — oo limit (Rasmussen 2000; Neal 2000). 

To compute the second term, p(x;|x_;, zi = k,z_;,A), let us partition the data x_; into 
clusters based on z_;. Let X-i = {x; : zj = c, j # i} be the data assigned to cluster c. If 
zi = k, then x; is conditionally independent of all the data points except those assigned to 


cluster k. Hence we have 


P(X, X—i,k|A) 
Xi kn Baa = kA = Xi XLi k, À =a a (25.36) 
P(x ) = plizie A) = Aa 
where 
P(Xi Xiklà) = J p(xilðr) | []  p(xj|Ox)} H(Ox|A)dOx (25.37) 
jAi:zj=k 


is the marginal likelihood of all the data assigned to cluster k, including 7, and p(x_;,,|A) is an 
analogous expression excluding i. Thus we see that the term p(x;|x_;,Z—;, zi = k, A) is the 
posterior preditive distribution for cluster k evaluated at x;. 

If z; = k*, corresponding to a new cluster, we have 


(eee E E / p(x;|8)H(6|d)d0 (25.38) 


which is just the prior predictive distribution for a new cluster evaluated at x;. 
See Algorithm 1 for the pseudocode. (This is called “Algorithm 3” in (Neal 2000).) This is very 
similar to collapsed Gibbs for finite mixtures except that we have to consider the case z; = k*. 
An example of this procedure in action is shown in Figure 25.7. The sample clusterings, and 
the induced posterior over K, seems reasonable. The method tends to rapidly discover a good 
clustering. By contrast, Gibbs sampling (and EM) for a finite mixture model often gets stuck in 
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Algorithm 25.1: Collapsed Gibbs sampler for DP mixtures 


1 for each i = 1 : N in random order do 


2 Remove x;,’s sufficient statistics from old cluster z; ; 
3 for each k = 1 : K do 

4 Compute px(xi) = p(xilx—:(k)); 

5 Set Nk, —i = dim(x_;(k)) ; 

6 Compute p(z; = k|z_;,D) = BOT 

7 Compute p,.(x;) = p(x,|A); 

8 Compute p(z; = *|z_-;,D) = INT 

9 Normalize p(z;|-); 

10 Sample z; ~ p(z;|-) ; 

1 Add x;’s sufficient statistics to new cluster z; ; 

12 If any cluster is empty, remove it and decrease K; 


poor local optima (not shown). This is because the DPMM is able to create extra redundant 
clusters early on, and to use them to escape local optima. Figure 25.8 shows that most of the 
time, the DPMM converges more rapidly than a finite mixture model. 

A variety of other fitting methods have been proposed. (Daume 2007a) shows how one can use 
A star search and beam search to quickly find an approximate MAP estimate. (Mansinghka et al. 
2007) discusses how to fit a DPMM online using particle filtering, which is a like a stochastic 
version of beam search. This can be more efficient than Gibbs sampling, particularly for large 
datasets. (Kurihara et al. 2006) develops a variational approximation that is even faster (see also 
(Zobay 2009)). Extensions to the case of non-conjugate priors are discussed in (Neal 2000). 

Another important issue is how to set the hyper-parameters. For the DP, the value of a 
does not have much impact on predictive accuracy, but it does affect the number of clusters. 
One approach is to put a Ga(a, b) prior for a, and then to from its posterior, p(a|K, N, a,b), 
using auxiliary variable methods (Escobar and West 1995). Alternatively, one can use empirical 
Bayes (McAuliffe et al. 2006). Similarly, for the base distribution, we can either sample the 
hyper-parameters A (Rasmussen 2000) or use empirical Bayes (McAuliffe et al. 2006). 


Affinity propagation 


Mixture models, whether finite or infinite, require access to the raw N x D data matrix, and 
need to specify a generative model of the data. An alternative approach takes as input an N x N 
similarity matrix, and then tries to identify examplars, which will act as cluster centers. The 
K-medoids or K-centers algorithm (Section 14.4.2) is one approach, but it can suffer from local 
minima. Here we describe an alternative approach called affinity propagation (Frey and Dueck 
2007) that works substantially better in practice. 

The idea is that each data point must choose another data point as its exemplar or centroid; 
some data points will choose themselves as centroids, and this will automatically determine the 
number of clusters. More precisely, let c; € {1,...,.N} represent the centroid for datapoint i. 
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Figure 25.7 100 data points in 2d are clustered using a DP mixture fit with collapsed Gibbs sampling. 
We show samples from the posterior after 50,100, 200 samples. We also show the posterior over K, based 
on 200 samples, discarding the first 50 as burnin. Figure generated by dpmGauss2dDemo, written by Yee 
Whye Teh. 


The goal is to maximize the following function 


N N 
S(c) = X` sli, ci) + 55 ôl) (25.39) 
k=1 


i=1 = 
The first term measures the similarity of each point to its centroid. The second term is a penalty 
term that is —oo if some data point i has chosen k as its exemplar (i.e., c; = k), but k has not 
chosen itself as an exemplar (i.e., we do not have c = k). More formally, 


Bele) = { —o if, Ak but Ji : c; = k 


~ ) 0 otherwise 


(25.40) 


The objective function can be represented as a factor graph. We can either use N nodes, 
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Figure 25.8 Comparison of collapsed Gibbs samplers for a DP mixture (dark blue) and a finite mixture 
(light red) with K = 4 applied to N = 300 data points (shown in Figure 25.7). Left: logprob vs iteration 
for 20 different starting values. Right: median (thick line) and quantiles (dashed lines) over 100 different 
starting values. Source: Figure 2.27 of (Sudderth 2006). Used with kind permission of Erik Sudderth. 


Figure 25.9 Factor graphs for affinity propagation. Circles are variables, squares are factors. Each c; node 


has N possible states. From Figure S2 of (Frey and Dueck 2007). Used with kind permission of Brendan 
Frey. 


each with N possible values, as shown in Figure 25.9, or we can use N? binary nodes (see 
(Givoni and Frey 2009) for the details). We will assume the former representation. 

We can find a strong local maximum of the objective by using max-product loopy belief 
propagation (Section 22.2). Referring to the model in Figure 25.9, each variable nodes c; sends 
a message to each factor node dx. It turns out that this vector of N numbers can be reduced 
to a scalar message, denote r;_,x, known as the responsibility. This is a measure of how much 
i thinks k would make a good exemplar, compared to all the other exemplars i has looked at. 
In addition, each factor node ô% sends a message to each variable node c;. Again this can be 
reduced to a scalar message, aip, known as the availability. This is a measure of how strongly 
k believes it should an exemplar for 7, based on all the other data points k has looked at. 

As usual with loopy BP, the method might oscillate, and convergence is not guaranteed. 
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Figure 25.10 Example of affinity propagation. Each point is colored coded by how much it wants to be 
an exemplar (red is the most, green is the least). This can be computed by summing up all the incoming 
availability messages and the self-similarity term. The darkness of the i > k arrow reflects how much 
point 7 wants to belong to exemplar k. From Figure 1 of (Frey and Dueck 2007). Used with kind permission 
of Brendan Frey. 


However, by using damping, the method is very reliable in practice. If the graph is densely 
connected, message passing takes O(N?) time, but with sparse similarity matrices, it only takes 
O(£) time, where E is the number of edges or non-zero entries in S. 

The number of clusters can be controlled by scaling the diagonal terms S(i,i), which reflect 
how much each data point wants to be an exemplar. Figure 25.10 gives a simple example of some 
2d data, where the negative Euclidean distance was used to measured similarity. The S(7, i) 
values were set to be the median of all the pairwise similarities. The result is 3 clusters. Many 
other results are reported in (Frey and Dueck 2007), who show that the method significantly 
outperforms K-medoids. 


Spectral clustering 


An alternative view of clustering is in terms of graph cuts. The idea is we create a weighted 
undirected graph W from the similarity matrix S, typically by using the nearest neighbors of 
each point; this ensures the graph is sparse, which speeds computation. If we want to find a 
partition into K clusters, say A;,..., Ax, one natural criterion is to minimize 


K 
1 _ 
cut(A;,...,Ax) Ê 3 ` W(Ap, Ak) (25.41) 
k=1 
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where A; = V \ Ax is the complement of Ag, and W (A, B) £ Žica jeB wij. For K = 2 this 
problem is easy to solve. Unfortunately the optimal solution often just partitions off a single 
data point from the rest. To ensure the sets are reasonably large, we can define the normalized 
cut to be 


K — 
Neut(Ay,..., Ax) Ê ; 5) cut( Ak, Ax) (25.42) 


where vol(A) £ Mies di and d; = = wij is the weighted degree of node i. This splits 
the graph into K clusters such that nodes within each cluster are similar to each other, but are 
different to nodes in other clusters. 

We can formulate the Ncut problem in terms of searching for binary vectors c; € {0,1}%, 
where Cip = 1 if point i belongs to cluster k, that minimize the objective. Unfortunately this 
is NP-hard (Wagner and Wagner 1993). Affinity propagation is one way to solve the problem. 
Another is to relax the constraints that c; be binary, and allow them to be real-valued. The 
result turns into an eigenvector problem known as spectral clustering (see e.g., (Shi and Malik 
2000)). In general, the technique of performing eigenalysis of graphs is called spectral graph 
theory (Chung 1997). 

Going into the details would take us too far afield, but below we give a very brief summary, 
based on (von Luxburg 2007), since we will encounter some of these ideas later on. 


Graph Laplacian 


Let W be a symmetric weight matrix for a graph, where w;; = wj; > 0. Let D = diag(d;) be a 
diaogonal matrix containing the weighted degree of each node. We define the graph Laplacian 
as follows: 


LÊêD-W (25.43) 


This matrix has various important properties. Because each row sums to zero, we have 
that 1 is an eigenvector with eigenvalue 0. Furthermore, the matrix is symmetric and positive 
semi-definite. To see this, note that 


fTLf = f'Df—£f We = 5 df= 5 fil wig (25.44) 
i ij 


1 1 : 
= 5| da -2) fihws+ dh] =3 3 whi- 054 
i ij j ig 


Hence f7 Lf > 0 for all f € R. Consequently we see that L has N non-negative, real-valued 
eigenvalues, 0 < Ay < Ag <... < Ay. 

To get some intuition as to why L might be useful for graph-based clustering, we note the 
following result. 


Theorem 25.4.1. The set of eigenvectors of L with eigenvalue 0 is spanned by the indicator vectors 
1a,,.--,14,, where Ay, are the K connected components of the graph. 
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Proof. Let us start with the case K = 1. If f is an eigenvector with eigenvalue 0, then 
0= Xay wis (fi — ig)" If two nodes are connected, so w;; > 0, we must have that f; = fj. 
Hence f is constant for all vertices which are connected by a path in the graph. Now suppose 
K > 1. In this case, L will be block diagonal. A similar argument to the above shows that we 
will have K indicator functions, which “select out” the connected components. 


This suggests the following algorithm. Compute the first K eigenvectors u; of L. Let 
U = [u;,..., ug] be an N x K matrix with the eigenvectors in its columns. Let y; € R* be 
the i'th row of U. Since these y; will be piecewise constant, we can apply K-means clustering 
to them to recover the connected components. Now assign point 7 to cluster k iff row i of Y 
was assigned to cluster k. 

In reality, we do not expect a graph derived from a real similarity matrix to have isolated 
connected components — that would be too easy. But it is reasonable to suppose the graph is 
a small “perturbation” from such an ideal. In this case, one can use results from perturbation 
theory to show that the eigenvectors of the perturbed Laplacian will be close to these ideal 
indicator functions (Ng et al. 2001). 

Note that this approach is related to kernel PCA (Section 14.4.4). In particular, KPCA uses the 
largest eigenvectors of W; these are equivalent to the smallest eigenvectors of I — W. This is 
similar to the above method, which computes the smallest eigenvectors of L = D — W. See 
(Bengio et al. 2004) for details. In practice, spectral clustering gives much better results than 
KPCA. 


Normalized graph Laplacian 


In practice, it is important to normalize the graph Laplacian, to account for the fact that some 
nodes are more highly connected than others. There are two comon ways to do this. One 
method, used in e.g., (Shi and Malik 2000; Meila 2001), creates a stochastic matrix where each 
row sums to one: 


L,. 4 DIL = I- DIW (25.46) 


The eigenvalues and eigenvectors of L and L,.,, are closely related to each other (see (von 

Luxburg 2007) for details). Furthemore, one can show that for Lyw, the eigenspace of 0 is 

again spanned by the indicator vectors 1,4,. This suggests the following algorithm: find the 

smallest K eigenvectors of Lw, create U, cluster the rows of U using K-means, then infer the 

partitioning of the original points (Shi and Malik 2000). (Note that the eigenvectors/ values of 

L,w are equivalent to the generalized eigenvectors/ values of L, which solve Lu = ADU.) 
Another method, used in e.g., (Ng et al. 2001), creates a symmetric matrix 


Lsym &D7?LD~? = I- D7? WD-? (25.47) 


This time the eigenspace of 0 is spanned by D21 A,- This suggest the following algorithm: find 
the smallest K eigenvectors of Lsym, create U, normalize each row to unit norm by creating 
tij = uij/\/ Q p U2) cluster the rows of T using K-means, then infer the partitioning of the 
original points (Ng et al. 2001). 

There is an interesting connection between Ncuts and random walks on a graph (Meila 
2001). First note that P = D-'W = I — L,.,, is a stochastic matrix, where Pij = wa, fd; 
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Figure 25.11 Clustering data consisting of 2 spirals. (a) K-means. (b) Spectral clustering. Figure generated 
by spectralClusteringDemo, written by Wei-Lwun Lu. 


can be interpreted as the probability of going from 7 to j. If the graph is connected and 
non-bipartite, it possesses a unique stationary distribution m = (7 ,...,7,), where m; = 
d;/vol(V). Furthermore, one can show that 


Neut(A, A) = p(A| A) + p( AJA) (25.48) 


This means that we are looking for a cut such that a random walk rarely makes transitions from 
A to A or vice versa. 


Example 


Figure 25.11 illustrates the method in action. In Figure 25.1l(a), we see that K-means does a poor 
job of clustering, since it implicitly assumes each cluster corresponds to a spherical Gaussian. 
Next we try spectral clustering. We define a similarity matrix using the Gaussian kernel. We 
compute the first two eigenvectors of the Laplacian. From this we can infer the clustering in 
Figure 25.11(b). 

Since the method is based on finding the smallest K eigenvectors of a sparse matrix, it takes 
O(N?) time. However, a variety of methods can be used to scale it up for large datasets (see 
e.g., (Yan et al. 2009)). 


Hierarchical clustering 


Mixture models, whether finite or infinite, produce a “flat” clustering. Often we want to learn a 
hierarchical clustering, where clusters can be nested inside each other. 

There are two main approaches to hierarchical clustering: bottom-up or agglomerative clus- 
tering, and top-down or divisive clustering. Both methods take as input a dissimilarity matrix 
between the objects. In the bottom-up approach, the most similar groups are merged at each 
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Figure 25.12 (a) An example of single link clustering using city block distance. Pairs (1,3) and (4,5) are 
both distance 1 apart, so get merged first. (b) The resulting dendrogram. Based on Figure 7.5 of (Alpaydin 
2004). Figure generated by agglomDemo. 
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Figure 25.13 Hierarchical clustering applied to the yeast gene expression data. (a) The rows are permuted 
according to a hierarchical clustering scheme (average link agglomerative clustering), in order to bring 
similar rows close together. (b) 16 clusters induced by cutting the average linkage tree at a certain height. 
Figure generated by hclustYeastDemo. 


step. In the top-down approach, groups are split using various different criteria. We give the 
details below. 

Note that agglomerative and divisive clustering are both just heuristics, which do not optimize 
any well-defined objective function. Thus it is hard to assess the quality of the clustering they 
produce in any formal sense. Furthermore, they will always produce a clustering of the input 
data, even if the data has no structure at all (e.g., it is random noise). Later in this section we 
will discuss a probabilistic version of hierarchical clustering that solves both these problems. 
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Algorithm 25.2: Agglomerative clustering 


1 initialize clusters as singletons: for i < 1 to n do C; + {i}; 


2 initialize set of clusters available for merging: S < {1,...,n}; 
3 repeat 
4 Pick 2 most similar clusters to merge: (j, k) < arg min; kes dj,ki 


5 Create new cluster Cy < Cj U Ck; 

6 Mark j and k as unavailable: S <— S \ {j,k}; 
7 if Ce A {1,...,n} then 

8 Mark £ as available, S + SU {4}; 

9 foreach ¿ € S do 

10 Update dissimilarity matrix d(i, £); 


u until no more clusters are available for merging; 


(a) (b) (©) 


Figure 25.14 Illustration of (a) Single linkage. (b) Complete linkage. (c) Average linkage. 


Agglomerative clustering 


Agglomerative clustering starts with N groups, each initially containing one object, and then at 
each step it merges the two most similar groups until there is a single group, containing all the 
data. See Algorithm 11 for the pseudocode. Since picking the two most similar clusters to merge 
takes O(N?) time, and there are O(N) steps in the algorithm, the total running time is O(N). 
However, by using a priority queue, this can be reduced to O(N? log N) (see e.g., (Manning 
et al. 2008, ch. 17) for details). For large N, a common heuristic is to first run K-means, which 
takes O(/K ND) time, and then apply hierarchical clustering to the estimated cluster centers. 

The merging process can be represented by a binary tree, called a dendrogram, as shown 
in Figure 25.12(b). The initial groups (objects) are at the leaves (at the bottom of the figure), 
and every time two groups are merged, we join them in the tree. The height of the branches 
represents the dissimilarity between the groups that are being joined. The root of the tree (which 
is at the top) represents a group containing all the data. If we cut the tree at any given height, 
we induce a clustering of a given size. For example, if we cut the tree in Figure 25.12(b) at 
height 2, we get the clustering {{{4,5}, {1,3}}, {2}}. We discuss the issue of how to choose 
the height/ number of clusters below. 

A more complex example is shown in Figure 25.13(a), where we show some gene expression 
data. If we cut the tree in Figure 25.13(a) at a certain height, we get the 16 clusters shown in 
Figure 25.13(b). 

There are actually three variants of agglomerative clustering, depending on how we define 
the dissimilarity between groups of objects. These can give quite different results, as shown in 
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Figure 25.15 Hierarchical clustering of yeast gene expression data. (a) Single linkage. (b) Complete linkage. 
(c) Average linkage. Figure generated by hclustYeastDemo. 
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Figure 25.15. We give the details below. 


Single link 


In single link clustering, also called nearest neighbor clustering, the distance between two 
groups G and H is defined as the distance between the two closest members of each group: 

ds(G, H) = e dii (25.49) 
See Figure 25.14(a). 

The tree built using single link clustering is a minimum spanning tree of the data, which 
is a tree that connects all the objects in a way that minimizes the sum of the edge weights 
(distances). To see this, note that when we merge two clusters, we connect together the two 
closest members of the clusters; this adds an edge between the corresponding nodes, and this 
is guaranteed to be the “lightest weight” edge joining these two clusters. And once two clusters 
have been merged, they will never be considered again, so we cannot create cycles. As a 
consequence of this, we can actually implement single link clustering in O(N?) time, whereas 
the other variants take O(N?) time. 


Complete link 


In complete link clustering, also called furthest neighbor clustering, the distance between 
two groups is defined as the distance between the two most distant pairs: 

dor (G, H) = ERE y dii (25.50) 
See Figure 25.14(b). 

Single linkage only requires that a single pair of objects be close for the two groups to 
be considered close together, regardless of the similarity of the other members of the group. 
Thus clusters can be formed that violate the compactness property, which says that all the 
observations within a group should be similar to each other. In particular if we define the 
diameter of a group as the largest dissimilarity of its members, dg = max;eg,i'ea dii, then 
we can see that single linkage can produce clusters with large diameters. Complete linkage 
represents the opposite extreme: two groups are considered close only if all of the observations 
in their union are relatively similar. This will tend to produce clusterings with small diameter, 
i.e., compact clusters. 


Average link 


In practice, the preferred method is average link clustering, which measures the average 
distance between all pairs: 


davg(G, H) = l SO die (25.51) 


nen 
GUT GEG ileH 


where ng and ny are the number of elements in groups G and H. See Figure 25.14(c). 
Average link clustering represents a compromise between single and complete link clustering. 
It tends to produce relatively compact clusters that are relatively far apart. However, since it 
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involves averaging of the d; ;/’s, any change to the measurement scale can change the result. In 
contrast, single linkage and complete linkage are invariant to monotonic transformations of d;_;/, 
since they leave the relative ordering the same. 


Divisive clustering 


Divisive clustering starts with all the data in a single cluster, and then recursively divides each 
cluster into two daughter clusters, in a top-down fashion. Since there are 2’ —! — 1 ways to split 
a group of N items into 2 groups, it is hard to compute the optimal split, so various heuristics 
are used. One approach is pick the cluster with the largest diameter, and split it in two using the 
K-means or K-medoids algorithm with K = 2. This is called the bisecting K-means algorithm 
(Steinbach et al. 2000). We can repeat this until we have any desired number of clusters. This 
can be used as an alternative to regular K-means, but it also induces a hierarchical clustering. 

Another method is to build a minimum spanning tree from the dissimilarity graph, and then 
to make new clusters by breaking the link corresponding to the largest dissimilarity. (This 
actually gives the same results as single link agglomerative clustering.) 

Yet another method, called dissimilarity analysis (Macnaughton-Smith et al. 1964), is as 
follows. We start with a single cluster containing all the data, G = {1,...,N}. We then 
measure the average dissimilarity of i € G to all the other i’ € G: 


1 
dv = — X diy (25.52) 
nG 3 l 


We remove the most dissimilar object and put it in its own cluster H: 


i* = arg max dy, G=G\{i*}, H={i"} (25.53) 
tE 


We now continue to move objects from G to H until some stopping criterion is met. Specifically, 
we pick a point 7* to move that maximizes the average dissimilarity to each 7’ € G but minimizes 
the average dissimilarity to each 7’ € H: 


1 
df = — X` div, i = arg maxdf — d? (25.54) 
NH eH iEG 


We continue to do this until d? — d¥ is negative. The final result is that we have split G into 
two daughter clusters, G and H. We can then recursively call the algorithm on G and/or H, or 
on any other node in the tree. For example, we might choose to split the node G whose average 
dissimilarity is highest, or whose maximum dissimilarity (i.e., diameter) is highest. We continue 
the process until the average dissimilarity within each cluster is below some threshold, and/or 
all clusters are singletons. 

Divisive clustering is less popular than agglomerative clustering, but it has two advantages. 
First, it can be faster, since if we only split for a constant number of levels, it takes just O(N) 
time. Second, the splitting decisions are made in the context of seeing all the data, whereas 
bottom-up methods make myopic merge decisions. 
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Choosing the number of clusters 


It is difficult to choose the “right” number of clusters, since a hierarchical clustering algorithm 
will always create a hierarchy, even if the data is completely random. But, as with choosing 
for K-means, there is the hope that there will be a visible “gap” in the lengths of the links in the 
dendrogram (represent the dissimilarity between merged groups) between natural clusters and 
unnatural clusters. Of course, on real data, this gap might be hard to detect. In Section 25.5.4, 
we will present a Bayesian approach to hierarchical clustering that nicely solves this problem. 


Bayesian hierarchical clustering 


There are several ways to make probabilistic models which produce results similar to hierarchical 
clustering, e.g., (Williams 2000; Neal 2003b; Castro et al. 2004; Lau and Green 2006). Here we 
present one particular approach called Bayesian hierarchical clustering (Heller and Ghahra- 
mani 2005). Algorithmically it is very similar to standard bottom-up agglomerative clustering, 
and takes comparable time, whereas several of the other techniques referenced above are much 
slower. However, it uses Bayesian hypothesis tests to decide which clusters to merge (if any), 
rather than computing the similarity between groups of points in some ad-hoc way. These 
hypothesis tests are closely related to the calculations required to do inference in a Dirichlet 
process mixture model, as we will see. Furthermore, the input to the model is a data matrix, 
not a dissimilarity matrix. 


The algorithm 


Let D = {x1,...,xw} represent all the data, and let D; be the set of datapoints at the leaves 
of the substree 7;. At each step, we compare two trees T; and T; to see if they should be 
merged into a new tree. Define D;; as their merged data, and let M;; = 1 if they should be 
merged, and M;; = 0 otherwise. 

The probability of a merge is given by 

a P(Dij|Mij = 1)p(Mijz = 1) 
rij p(DalTy) (25.55) 

P(Dig|Tig) = p(Dig|Mig = 1)p(Mij = 1) + p(Dig|Miz = 0)p(Mi; = 0) (25.56) 
Here p(M;; = 1) is the prior probability of a merge, which can be computed using a bottom-up 
algorithm described below. We now turn to the likelihood terms. If M;; = 1, the data in D;; is 
assumed to come from the same model, and hence 


p(Dij|Miy = 1) = J IL pelo] pojada 25.57) 


Xn EDij 
If M;; = 0, the data in D,; is assumed to have been generated by each tree independently, so 
P(Dij|Mij = 0) = p(Di|Ti)p(Dy|T5) (25.58) 
These two terms will have already been computed by the bottom-up process. Consequently 
we have all the quantities we need to decide which trees to merge. See Algorithm 9 for the 


pseudocode, assuming p(M;;j) is uniform. When finished, we can cut the tree at points where 
Tij < 0.5. 
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Algorithm 25.3: Bayesian hierarchical clustering 

1 Initialize D; = {x;},i=1:N; 

2 Compute p(D;|T;),i=1:N ; 

3 repeat 

4 for each pair of clusters i, 7 do 

5 | Compute p(Dj;|Ti;) 

6 Find the pair D; and D; with highest merge probability r;;; 
7 Merge Dy, := D; UD;; 

8 Delete D;, D; ; 

9 until all clusters merged; 


The connection with Dirichlet process mixture models 


In this section, we will establish the connection between BHC and DPMMs. This will in turn 
give us an algorithm to compute the prior probabilities p(M;; = 1). 
Note that the marginal likelihood of a DPMM, summing over all 2% — 1 partitions, is given by 


pDr) = SY - p(v)p(Dv) (25.59) 
vEV 
My mau T v 
po) = ce a (25.60) 
Fine) 
pD) = [r2 (25.61) 
j=l 


where V is the set of all possible partitions of Dg, p(v) is the probability of partition v, mẹ is 
the number of clusters in partition v, ny is the number of points in cluster l of partition v, D? 
are the points in cluster l of partition v, and nę are the number of points in Dx. 

One can show (Heller and Ghahramani 2005) that p(D;|Tp) computed by the BHC algorithm 
is similar to p(D;.) given above, except for the fact that it only sums over partitions which are 
consistent with tree Tk. (The number of tree-consistent partitions is exponential in the number 
of data points for balanced binary trees, but this is obviously a subset of all possible partitions.) 
In this way, we can use the BHC algorithm to compute a lower bound on the marginal likelihood 
of the data from a DPMM. Furthermore, we can interpret the algorithm as greedily searching 
through the exponentially large space of tree-consistent partitions to find the best ones of a 
given size at each step. 

We are now in a position to compute 7, = p( Mp = 1), for each node k with children ¿i and 
j. This is equal to the probability of cluster Dk coming from the DPMM, relative to all other 
partitions of Dp consistent with the current tree. This can be computed as follows: initialize 
di = a and 7; = 1 for each leaf i; then as we build the tree, for each internal node k, compute 


dk = aT (ng) + did;, and Tk = oP (rie) where 7 and j are k’s left and right children. 


k 
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Data Set Single Linkage Complete Linkage Average Linkage BHC 

Synthetic 0.599 + 0.033 0.634 + 0.024 0.668 + 0.040 0.828 + 0.025 
Newsgroups 0.275 + 0.001 0.315 0.008 0.282 + 0.002 0.465 + 0.016 
Spambase 0.598 + 0.017 0.699 + 0.017 0.668 + 0.019 0.728 + 0.029 
Digits 0.224 + 0.004 0.299 + 0.006 0.342 + 0.005 0.393 + 0.015 
Fglass 0.478 + 0.009 0.476 Æ 0.009 0.491 + 0.009 0.467 + 0.011 


Table 25.1 Purity scores for various hierarchical clustering schemes applied to various data sets. The 
synthetic data has N = 200,D = 2,C = 4 and real features. Newsgroups is extracted from the 20 
newsgroups dataset (D = 500, N = 800, C = 4, binary features). Spambase has N = 100,C = 2, D 
57 , binary features. Digits is the CEDAR Buffalo digits (V = 200,C = 10, D = 64, binarized features). 
Fglass is forensic glass dataset (V = 214, C = 6, D = 9, real features). Source: Table 1 of (Heller and 
Ghahramani 2005). Used with kind permission of Katherine Heller. 


Learning the hyper-parameters 


The model has two free-parameters: œ and A, where A are the hyper-parameters for the prior 

on the parameters 0. In (Heller and Ghahramani 2005), they show how one can back-propagate 

Op(Dr|Tr) 
OX 


gradients of the form through the tree, and thus perform an empirical Bayes estimate 


of the hyper-parameters. 


Experimental results 


(Heller and Ghahramani 2005) compared BHC with traditional agglomerative clustering algo- 
rithms on various data sets in terms of purity scores. The results are shown in Table 25.1. We 
see that BHC did much better than the other methods on all datasets except the forensic glass 
one. 

Figure 25.16 visualizes the tree structure estimated by BHC and agglomerative hierarchical 
clustering (AHC) on the newsgroup data (using a beta-Bernoulli model). The BHC tree is clearly 
superior (look at the colors at the leaves, which represent class labels). Figure 25.17 is a zoom-in 
on the top few nodes of these two trees. BHC splits off clusters concerning sports from clusters 
concerning cars and space. AHC keeps sports and cars merged together. Although sports and 
cars both fall under the same “rec” newsgroup heading (as opposed to space, that comes under 
the “sci” newsgroup heading), the BHC clustering still seems more reasonable, and this is borne 
out by the quantitative purity scores. 

BHC has also been applied to gene expression data, with good results (Savage et al. 2009). 


Clustering datapoints and features 


So far, we have been concentrating on clustering datapoints. But each datapoint is often 
described by multiple features, and we might be interested in clustering them as well. Below we 
describe some methods for doing this. 
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Figure 25.16 Hierarchical clustering applied to 800 documents from 4 newsgroups (red is rec.autos, blue 
is rec.sport.baseball, green is rec.sport.hockey, and magenta is sci.space). Top: average linkage hierarchical 
clustering. Bottom: Bayesian hierarchical clustering. Each of the leaves is labeled with a color, according 
to which newsgroup that document came from. We see that the Bayesian method results in a clustering 
that is more consistent with these labels (which were not used during model fitting). Source: Figure 7 of 
(Heller and Ghahramani 2005). Used with kind permission of Katherine Heller. 
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Figure 25.17 Zoom-in on the top nodes in the trees of Figure 25.16. (a) Bayesian method. (b) Average 
linkage. We show the 3 most probable words per cluster. The number of documents at each cluster is also 
given. Source: Figure 5 of (Heller and Ghahramani 2005). Used with kind permission of Katherine Heller. 


Biclustering 


Clustering the rows and columns is known as biclustering or coclustering. This is widely used 
in bioinformatics, where the rows often represent genes and the columns represent conditions. 
It can also be used for collaborative filtering, where the rows represent users and the columns 
represent movies. 

A variety of ad hoc methods for biclustering have been proposed; see (Madeira and Oliveira 
2004) for a review. Here we present a simple probabilistic generative model, based on (Kemp 
et al. 2006) (see also (Sheng et al. 2003) for a related approach). The idea is to associate each 
row and each column with a latent indicator, r; € {1,...,A"}, cj € {1,...,K°}. We then 
assume the data are iid across samples and across features within each block: 


p(x|r,c, 0) ~ IMI Teg lri Cys. 0) = plti lOr e) (25.62) 


where Oa,» are the parameters for row cluster a and column cluster b. Rather than using a finite 
number of clusters for the rows and columns, we can use a Dirchlet process, as in the infinite 
relational model which we discuss in Section 27.6.1. We can fit this model using e.g., (collapsed) 
Gibbs sampling. 

The behavior of this model is illustrated in Figure 25.18. The data has the form X (i, j) = 1 
iff animal 7 has feature j, where i = 1 : 50 and j = 1 : 85. The animals represent whales, bears, 
horses, etc. The features represent properties of the habitat (jungle, tree, coastal), or anatomical 
properties (has teeth, quadrapedal), or behavioral properties (swims, eats meat), etc. The model, 
using a Bernoulli likelihood, was fit to the data. It discovered 12 animal clusters and 33 feature 
clusters. For example, it discovered a bicluster that represents the fact that mammals tend to 
have aquatic features. 


Multi-view clustering 


The problem with biclustering is that each object (row) can only belong to one cluster. Intuitively, 
an object can have multiple roles, and can be assigned to different clusters depending on which 
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O1 killer whale, blue whale, humpback, seal, walrus, dolphin Fi 23456 
02 antelope, horse, giraffe, zebra, deer 
03 monkey, gorilla, chimp 

04 hippo, elephant, rhino 

O5 grizzly bear, polar bear 


F1 flippers, strain teeth, swims, arctic, coastal, ocean, water 
F2 hooves, long neck, horns 

F3 hands, bipedal, jungle, tree 

F4 bulbous body shape, slow, inactive 

F5 meat teeth, eats meat, hunter, fierce 

F6 walks, quadrapedal, ground 


Figure 25.18 Illustration of biclustering . We show 5 of the 12 animal clusters, and 6 of the 33 feature 
clusters. The original data matrix is shown, partitioned according to the discovered clusters. From Figure 
3 of (Kemp et al. 2006). Used with kind permission of Charles Kemp. 


(a) (b) 


Figure 25.19 (a) Illustration of multi-view clustering. Here we have 3 views (column partitions). In the 
first view, we have 2 clusters (row partitions). In the second view, we have 3 clusters. In the third view, 
we have 2 clusters. The number of views and partitions are inferred from data. Rows within each colored 
block are assumed to generated iid; however, each column can have a different distributional form, which 
is useful for modeling discrete and continuous data. From Figure 1 of (Guan et al. 2010). Used with kind 
permission of Jennifer Dy. (b) Corresponding DGM. 


subset of features you use. For example, in the animal dataset, we may want to group the 
animals on the basis of anatomical features (e.g., mammals are warm blooded, reptiles are not), 
or on the basis of behavioral features (e.g., predators vs prey). 

We now present a model that can capture this phenomenon. This model was indepen- 
dently proposed in (Shafto et al. 2006; Mansinghka et al. 2011), who call it crosscat (for cross- 
categorization), and in (Guan et al. 2010; Cui et al. 2010), who call it (non-parametric) multi-clust. 
(See also (Rodriguez and Ghosh 2011) for a very similar model.) The idea is that we partition the 
columns (features) into V groups or views, so cj € {1,...,V }, where j € {1,..., D} indexes 
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features. We will use a Dirichlet process prior for p(c), which allows V to grow automatically. 
Then for each partition of the columns (i.e., each view), call it v, we partition the rows, again 
using a DP, as illustrated in Figure 25.19(a). Let rj, € {1,..., K(v)} be the cluster to which 
the ith row belongs in view v. Finally, having partitioned the rows and columns, we generate 
the data: we assume all the rows and columns within a block are iid. We can define the model 
more precisely as follows: 


ple,r,D) = ple)p(r|c)p(PIr, c) (25.63) 
plec) = DP(cla) (25.64) 
V(c) 
p(r|c) = Me r|) (25.65) 
V (c) K(ry) 


p(Pir,c,@) = JI lI I / J[ pri] ;%)r(jx)d0 5x (25.66) 


v=1 j:cj=u | k=1 UTig=k 


See Figure 25.19(b) for the DGM.? 
If the data is binary, and we use a Beta(y, y) prior for 05x, the likelihood reduces to 


K(ry) 


Betas pa +Y Tjk o +7) 
p(D]r,c, y) -D Il Il Betan) (25.67) 


v=l1 j:cj=v k=1 


where nj ku =}; shh I(xi; = 1) counts the number of features which are on in the j’th 
column for view v and for row cluster k. Similarly, Tj k, counts how many features are off. 
The model is easily extended to other kinds of data, by replacing the beta-Bernoulli with, say, 
the Gaussian-Gamma-Gaussian model, as discussed in (Guan et al. 2010; Mansinghka et al. 2011). 

Approximate MAP estimation can be done using stochastic search (Shafto et al. 2006), and 
approximate inference can be done using variational Bayes (Guan et al. 2010) or Gibbs sampling 
(Mansinghka et al. 2011). The hyper-parameter y for the likelihood can usually be set in a non- 
informative way, but results are more sensitive to the other two parameters, since a controls 
the number of column partitions, and 8 controls the number of row partitions. Hence a more 
robust technique is to infer the hyper-parameters using MH. This also speeds up convergence 
(Mansinghka et al. 2011). 

Figure 25.20 illustrates the model applied to some binary data containing 22 animals and 106 
features. The figures shows the (approximate) MAP partition. The first partition of the columns 
contains taxonomic features, such as “has bones”, “is warm-blooded”, “lays eggs”, etc. This 
divides the animals into birds, reptiles/ amphibians, mammals, and invertebrates. The second 
partition of the columns contains features that are treated as noise, with no apparent structure 
(except for the single row labeled “frog”). The third partition of the columns contains ecological 
features like “dangerous”, “carnivorous”, “lives in water”, etc. This divides the animals into prey, 
land predators, sea predators and air predators. Thus each animal (row) can belong to a different 


2. The dependence between r and c is not shown, since it is not a dependence between the values of riy and c;, but 
between the cardinality of v and cj. In other words, the number of row partitions we need to specify (the number of 
views, indexed by v) depends on the number of column partitions (clusters) that we have. 
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Figure 25.20 
animals (rows) by features (columns). See text for details. 
with kind permission of Vikash Mansingkha. 


MAP estimate produced by the crosscat system when applied to a binary data matrix of 
Source: Figure 7 of (Shafto et al. 2006) . Used 


cluster depending on what set of features are considered. Uncertainty about the partitions can 
be handled by sampling. 

It is interesting to compare this model to a standard infinite mixture model. While the 
standard model can represent any density on fixed-sized vectors as N — ov, it cannot cope 
with D — oo, since it has no way to handle irrelevant, noisy or redundant features. By contrast, 
the crosscat/multi-clust system is robust to irrelevant features: it can just partition them off, 
and cluster the rows only using the relevant features. Note, however, that it does not need a 
separate “background” model, since everything is modelled using the same mechanism. This is 
useful, since one’s person's noise is another person's signal. (Indeed, this symmetry may explain 
why multi-clust outperformed the sparse mixture model approach of (Law et al. 2004) in the 
experiments reported in (Guan et al. 2010).) 


Graphical model structure learning 


26.1 Introduction 


We have seen how graphical models can be used to express conditional independence assump- 
tions between variables. In this chapter, we discuss how to learn the structure of the graphical 
model itself. That is, we want to compute p(G|D), where G is the graph structure, represented 
as an V x V adjacency matrix. 

As we discussed in Section 1.3.3, there are two main applications of structure learning: knowl- 
edge discovery and density estimation. The former just requires a graph topology, whereas the 
latter requires a fully specified model. 

The main obstacle in structure learning is that the number of possible graphs is exponential in 
the number of nodes: a simple upper bound is O(2”(Y~)/). Thus the full posterior p(G|D) 
is prohibitively large: even if we could afford to compute it, we could not even store it. So we 
will seek appropriate summaries of the posterior. These summary statistics depend on our task. 

If our goal is knowledge discovery, we may want to compute posterior edge marginals, 
p(Gst = 1\D); we can then plot the corresponding graph, where the thickness of each edge 
represents our confidence in its presence. By setting a threshold, we can generate a sparse 
graph, which can be useful for visualization purposes (see Figure 1.11). 

If our goal is density estimation, we may want to compute the MAP graph, G € argmaxg p(G|D). 
In most cases, finding the globally optimal graph will take exponential time, so we will use dis- 
crete optimization methods such as heuristic search. However, in the case of trees, we can 
find the globally optimal graph structure quite efficiently using exact methods, as we discuss in 
Section 26.3. 

If density estimation is our only goal, it is worth considering whether it would be more 
appropriate to learn a latent variable model, which can capture correlation between the visible 
variables via a set of latent common causes (see Chapters 12 and 27). Such models are often 
easier to learn and, perhaps more importantly, they can be applied (for prediction purposes) 
much more efficiently, since they do not require performing inference in a learned graph with 
potentially high treewidth. The downside with such models is that the latent factors are often 
unidentifiable, and hence hard to interpret. Of course, we can combine graphical model structure 
learning and latent variable learning, as we will show later in this chapter. 

In some cases, we don't just want to model the observed correlation between variables; 
instead, we want to model the causal structure behind the data, so we can predict the effects 
of manipulating variables. This is a much more challenging task, which we briefly discuss in 


26.2 


26.2.1 
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Figure 26.1 Part of a relevance network constructed from the 20-news data shown in Figure 1.2. We 
show edges whose mutual information is greater than or equal to 20% of the maximum pairwise MI. For 
clarity, the graph has been cropped, so we only show a subset of the nodes and edges. Figure generated 
by relevanceNetworkNewsgroupDemo. 


Section 26.6. 


Structure learning for knowledge discovery 


Since computing the MAP graph or the exact posterior edge marginals is in general computa- 
tionally intractable (Chickering 1996), in this section we discuss some “quick and dirty” methods 
for learning graph structures which can be used to visualize one’s data. The resulting models do 
not constitute consistent joint probability distributions, so they cannot be used for prediction, 
and they cannot even be formally evaluated in terms of goodness of fit. Nevertheless, these 
methods are a useful ad hoc tool to have in one’s data visualization toolbox, in view of their 
speed and simplicity. 


Relevance networks 


A relevance network is a way of visualizing the pairwise mutual information between multiple 
random variables: we simply choose a threshold and draw an edge from node i to node j if 
I (X;;Xj) is above this threshold. In the Gaussian case, I (X;; Xj) = —} log(1 — pij), where 
pi; is the correlation coefficient (see Exercise 2.13), so we are essentially visualizing &; this is 
known as the covariance graph (Section 19.4.4.1). 

This method is quite popular in systems biology (Margolin et al. 2006), where it is used to 
visualize the interaction between genes. The trouble with biological examples is that they are 
hard for non-biologists to understand. So let us instead illustrate the idea using natural language 
text. Figure 26.1 gives an example, where we visualize the MI between words in the newsgroup 
dataset from Figure 1.2. The results seem intuitively reasonable. 

However, relevance networks suffer from a major problem: the graphs are usually very dense, 
since most variables are dependent on most other variables, even after thresholding the MIs. 
For example, suppose X, directly influences Xə which directly influences X3 (e.g., these form 
components of a signalling cascade, X¥; — Xə — X3). Then X, has non-zero MI with X; (and 
vice versa), so there will be a 1 — 3 edge in the relevance network. Indeed, most pairs will be 
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Figure 26.2 A dependency network constructed from the 20-news data. We show all edges with regres- 
sion weight above 0.5 in the Markov blankets estimated by ¢; penalized logistic regression. Undirected 
edges represent cases where a directed edge was found in both directions. From Figure 4.9 of (Schmidt 
2010). Used with kind permission of Mark Schmidt. 


connected. 

A better approach is to use graphical models, which represent conditional independence, 
rather than dependence. In the above example, X, is conditionally independent of X3 given 
Xa, so there will not be a 1 — 3 edge. Consequently graphical models are usually much sparser 
than relevance networks, and hence are a more useful way of visualizing interactions between 
multiple variables. 


Dependency networks 


A simple and efficient way to learn a graphical model structure is to independently fit D sparse 
full-conditional distributions p(x;|x_,); this is called a dependency network (Heckerman et al. 
2000). The chosen variables constitute the inputs to the node, i.e., its Markov blanket. We 
can then visualize the resulting sparse graph. The advantage over relevance networks is that 
redundant variables will not be selected as inputs. 

We can use any kind of sparse regression or classification method to fit each CPD. (Heckerman 
et al. 2000) uses classification/ regression trees, (Meinshausen and Buhlmann 2006) use 41- 
regularized linear regression, (Wainwright et al. 2006) use ¢)-regularized logistic regression (see 
depnetFit for some code), (Dobra 2009) uses Bayesian variable selection, etc. (Meinshausen 
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and Buhlmann 2006) discuss theoretical conditions under which ;-regularized linear regression 
can recover the true graph structure, assuming the data was generated from a sparse Gaussian 
graphical model. 

Figure 26.2 shows a dependency network that was learned from the 20-newsgroup data using 
lı regularized logistic regression, where the penalty parameter A was chosen by BIC. Many 
of the words present in these estimated Markov blankets represent fairly natural associations 
(aids:disease, baseball:fans, bible:god, bmw:car, cancer:patients, etc.). However, some of the esti- 
mated statistical dependencies seem less intuitive, such as baseball:windows and bmw:christian. 
We can gain more insight if we look not only at the sparsity pattern, but also the values of the 
regression weights. For example, here are the incoming weights for the first 5 words: 


e aids: children (0.53), disease (0.84), fact (0.47), health (0.77), president (0.50), research (0.53) 


e baseball: christian (-0.98), drive (-0.49), games (0.81), god (-0.46), government (-0.69), hit (0.62), 
memory (-1.29), players (1.16), season (0.31), software (-0.68), windows (-1.45) 


e bible: car (-0.72), card (-0.88), christian (0.49), fact (0.21), god (1.01), jesus (0.68), orbit (0.83), 
program (-0.56), religion (0.24), version (0.49) 


e bmw: car (0.60), christian (-11.54), engine (0.69), god (-0.74), government (-1.01), help (-0.50), 
windows (-1.43) 


e cancer: disease (0.62), medicine (0.58), patients (0.90), research (0.49), studies (0.70) 


Words in italic red have negative weights, which represents a dissociative relationship. For 
example, the model reflects that baseball:windows is an unlikely combination. It turns out that 
most of the weights are negative (1173 negative, 286 positive, 8541 zero) in this model. 

In addition to visualizing the data, a dependency network can be used for inference. However, 
the only algorithm we can use is Gibbs sampling, where we repeatedly sample the nodes with 
missing values from their full conditionals. Unfortunately, a product of full conditionals does 
not, in general, constitute a representation of any valid joint distribution (Heckerman et al. 
2000), so the output of the Gibbs sampler may not be meaningful. Nevertheless, the method can 
sometimes give reasonable results if there is not much missing data, and it is a useful method 
for data imputation (Gelman and Raghunathan 2001). In addition, the method can be used as 
an initialization technique for more complex structure learning methods that we discuss below. 


Learning tree structures 


For the rest of this chapter, we focus on learning fully specified joint probability models, which 
can be used for density estimation, prediction and knowledge discovery. 

Since the problem of structure learning for general graphs is NP-hard (Chickering 1996), we 
start by considering the special case of trees. Trees are special because we can learn their 
structure efficiently, as we disuscs below, and because, once we have learned the tree, we can 
use them for efficient exact inference, as discussed in Section 20.2. 
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Figure 26.3 An undirected tree and two equivalent directed trees. 


Directed or undirected tree? 


Before continuing, we need to discuss the issue of whether we should use directed or undirected 
trees. A directed tree, with a single root node r, defines a joint distribution as follows: 


p(x|T) = [J peltoa) (26.1) 
teV 
where we define pa(r) = Ø. For example, in Figure 26.3(b-c), we have 
pl£1, £2, £3, £4|T) = p(x1)p(x2|r1)p(x3|r2)p(x4|r2) (26.2) 
= p(x2)p(zı|x2)p(z3|x2)p(x4|z2) (26.3) 


We see that the choice of root does not matter: both of these models are equivalent. 
To make the model more symmetric, it is preferable to use an undirected tree. This can be 
represented as follows: 


p(x|T) = ] [ee lI Piti (26.4) 


ay (per Ps)ple) 


where p(x5, x+) is an edge marginal and p(x+) is a node marginal. For example, in Figure 26.3(a) 
we have 


P(@1, £2)p(£2, £3)p(x2, £4) 


plen ta, zal) = prr )pleaples)plea Te pian plepa) P 
To see the equivalence with the directed representation, let us cancel terms to get 
plentntnT) = pen e ea (26.6) 
p(z2)  p(z2) 
p(©1)p(©2|x1)p(x3|x2)p(wa| x2) (26.7) 
= p(x2)p(r1|x2)p(x3|x2)p(r4| x2) (26.8) 


where palate) = plats, 2s)/p(s). 

Thus a tree can be represented as either an undirected or directed graph: the number of 
parameters is the same, and hence the complexity of learning is the same. And of course, 
inference is the same in both representations, too. The undirected representation, which is 
symmetric, is useful for structure learning, but the directed representation is more convenient 
for parameter learning. 
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Chow-Liu algorithm for finding the ML tree structure 


Using Equation 26.4, we can write the log-likelihood for a tree as follows: 


logp(D|9,T) = SY > S Nix log p(x: = k|0) 
$ k 
plzs = j, t = klO) 
+ Nstik lo - (26.9) 
DL Naij S plas = J1O)p(a = klO) 


st j,k 


where Nstjk is the number of times node s is in state j and node t is in state k, and Nyy, is 
the number of times node t is in state k. We can rewrite these counts in terms of the empirical 
distribution: Nstjk = Npemp(@s = j,t = k) and Nik = NpPemp(£t = k). Setting 0 to the 
MLEs, this becomes 


log p(D|O, T 
wee ~ So Pemp(t: = k) log pemp(x = k) (26.10) 
tEV k 
+ 5y I(z£s, 1/91) (26.11) 
(s,t)€E(T) 


where I(x, 24|4st) > 0 is the mutual information between x, and x, given the empirical 
distribution: 


I(x, zils) = 5 S dealt = J Tt = k) log 
j k 


Pemp(Ts = j, t4 = k) 
Pemp (Ts = J)Pemp (Xt = k) 


(26.12) 


Since the first term in Equation 26.11 is independent of the topology T, we can ignore it when 
learning structure. Thus the tree topology that maximizes the likelihood can be found by 
computing the maximum weight spanning tree, where the edge weights are the pairwise mutual 
informations, I(y,, y\0 st). This is called the Chow-Liu algorithm (Chow and Liu 1968). 

There are several algorithms for finding a max spanning tree (MST). The two best known are 
Prim’s algorithm and Kruskal’s algorithm. Both can be implemented to run in O(E log V) time, 
where E = V? is the number of edges and V is the number of nodes. See e.g., (Sedgewick and 
Wayne 2011, 4.3) for details. Thus the overall running time is O(N V? + V2 log V), where the 
first term is the cost of computing the sufficient statistics. 

Figure 26.4 gives an example of the method in action, applied to the binary 20 newsgroups 
data shown in Figure 1.2. The tree has been arbitrarily rooted at the node representing “email”. 
The connections that are learned seem intuitively reasonable. 


Finding the MAP forest 


Since all trees have the same number of parameters, we can safely used the maximum likelihood 
score as a model selection criterion without worrying about overfitting. However, sometimes we 
may want to fit a forest rather than a single tree, since inference in a forest is much faster than 
in a tree (we can run belief propagation in each tree in the forest in parallel). The MLE criterion 
will never choose to omit an edge. However, if we use the marginal likelihood or a penalized 
likelihood (such as BIC), the optimal solution may be a forest. Below we give the details for the 
marginal likelihood case. 
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Figure 26.4 The MLE tree on the 20-newsgroup data. From Figure 4.11 of (Schmidt 2010). Used with kind 
permission of Mark Schmidt. (A topologically equivalent tree can be produced using chowliuTreeDemo.) 


In Section 26.4.2.2, we explain how to compute the marginal likelihood of any DAG using a 
Dirichlet prior for the CPTs. The resulting expression can be written as follows: 


log p(D|T) = x oe f Tho Lit|Xi,pa(t)|O2)p(O1)dO, = 5 score(N; pa(t)) (26.13) 


teV t 


where N: pat) are the counts (sufficient statistics) for node t and its parents, and score is 
defined in Equation 26.28. 

Now suppose we only allow DAGs with at most one parent. Following (Heckerman et al. 1995, 
p227), let us associate a weight with each s — t edge, ws; = score(t|s) — score(t|0), where 
score(t|0) is the score when t has no parents. Note that the weights might be negative (unlike 
the MLE case, where edge weights are aways non-negative because they correspond to mutual 
information). Then we can rewrite the objective as follows: 


log p(D|T) = > score(t|pa(t) E Wpa(t),t + >, score(t|0) (26.14) 


t 


The last term is the same for all trees T, so we can ignore it. Thus finding the most probable 
tree amounts to finding a maximal branching in the corresponding weighted directed graph. 
This can be found using the algorithm in (Gabow et al. 1984). 
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If the scoring function is prior and likelihood equivalent (these terms are explained in Sec- 
tion 26.4.2.3), we have 


score(s|t) + score(t|0) = score(t|s) + score(s|0) (26.15) 


and hence the weight matrix is symmetric. In this case, the maximal branching is the same 
as the maximal weight forest. We can apply a slightly modified version of the MST algorithm 
to find this (Edwards et al. 2010). To see this, let G = (V, E) be a graph with both positive 
and negative edge weights. Now let G’ be a graph obtained by omitting all the negative edges 
from G. This cannot reduce the total weight, so we can find the maximum weight forest of G 
by finding the MST for each connected component of G’. We can do this by running Kruskal’s 
algorithm directly on G’: there is no need to find the connected components explicitly. 


Mixtures of trees 


A single tree is rather limited in its expressive power. Later in this chapter we discuss ways to 
learn more general graphs. However, the resulting graphs can be expensive to do inference in. 
An interesting alternative is to learn a mixture of trees (Meila and Jordan 2000), where each 
mixture component may have a different tree topology. This is like an unsupervised version of 
the TAN classifier discussed in Section 10.2.1. We can fit a mixture of trees by using EM: in the 
E step, we compute the responsibilities of each cluster for each data point, and in the M step, 
we use a weighted version of the Chow-Liu algorithm. See (Meila and Jordan 2000) for details. 
In fact, it is possible to create an “infinite mixture of trees”, by integrating out over all possible 
trees. Remarkably, this can be done in V? time using the matrix tree theorem. This allows us to 
perform exact Bayesian inference of posterior edge marginals etc. However, it is not tractable to 
use this infinite mixture for inference of hidden nodes. See (Meila and Jaakkola 2006) for details. 


Learning DAG structures 


In this section, we discuss how to compute (functions of) p(G|D), where G is constrained to be 
a DAG. This is often called Bayesian network structure learning. In this section, we assume 
there is no missing data, and that there are no hidden variables. This is called the complete 
data assumption. For simplicity, we will focus on the case where all the variables are categorical 
and all the CPDs are tables, although the results generalize to real-valued data and other kinds 
of CPDs, such as linear-Gaussian CPDs. 

Our presentation is based in part on (Heckerman et al. 1995), although we will follow the 
notation of Section 10.4.2. In particular, let £x; € {1,..., Kı} be the value of node t in case i, 
where K; is the number of states for node t. Let frek = p(zı = k|Xpat) = £), fork = 1: Ky, 
and c = 1 : C;, where C; is the number of parent combinations (possible conditioning cases). 
For notational simplicity, we will often assume K; = K, so all nodes have the same number of 
states. We will also let d; = dim(pa(t)) be the degree or fan-in of node t, so that C; = K“. 


Markov equivalence 


In this section, we discuss some fundamental limits to our ability to learn DAG structures from 
data. 
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Figure 26.5 Three DAGs. Gi and G3 are Markov equivalent, G2 is not. 


Consider the following 3 DGMs: X > Y > Z, X + Y + Z and X + Y > Z. These all 
represent the same set of CI statements, namely 


XAZGY, 222 (26.16) 


We say these graphs are Markov equivalent, since they encode the same set of CI assumptions. 
That is, they all belong to the same Markov equivalence class. However, the v-structure 
X —+ Y + Z encodes X L Z and X  Z\Y, which represents the opposite set of CI 
assumptions. 

One can prove the following theorem. 


Theorem 26.4.1 (Verma and Pearl (Verma and Pearl 1990)). Tivo structures are Markov equivalent 
iff they have the same undirected skeleton and the same set of v-structures. 


For example, referring to Figure 26.5, we see that Gi Æ Gz, since reversing the 2 > 4 arc 
creates a new v-structure. However, G4 = G3, since reversing the 1 — 5 arc does not create a 
new v-structure. 

We can represent a Markov equivalence class using a single partially directed acyclic graph 
(PDAG), also called an essential graph or pattern, in which some edges are directed and some 
undirected. The undirected edges represent reversible edges; any combination is possible so 
long as no new v-structures are created. The directed edges are called compelled edges, since 
changing their orientation would change the v-structures and hence change the equivalence 
class. For example, the PDAG X — Y — Z represents {X > Y > Z,X Y &Z,Xă&Y > 
Z} which encodes X £ Z and X L Z|Y. See Figure 26.6. 

The significance of the above theorem is that, when we learn the DAG structure from data, 
we will not be able to uniquely identify all of the edge directions, even given an infinite amount 
of data. We say that we can learn DAG structure “up to Markov equivalence”. This also cautions 
us not to read too much into the meaning of particular edge orientations, since we can often 
change them without changing the model in any observable way. 
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Figure 26.6 PDAG representation of Markov equivalent DAGs. 


Exact structural inference 

In this section, we discuss how to compute the exact posterior over graphs, p(G|D), ignoring 
for now the issue of computational tractability. 

Deriving the likelihood 


Assuming there is no missing data, and that all CPDs are tabular, the likelihood can be written 
as follows: 


N V 
p(DIG,O) = |] [ Cat(aislx: pa, 91) (26.17) 
4=1 f= 
N V C 
= [LL ]] Catieila,. o= (26.18) 
i=1t=1c=1 


Cy Ky 


N V 
= Tih (26.19) 


i=1t=1c=1 k=l 
VG, K: 


= Tee (26.20) 


t=1 c=1 k=1 


where Nep is the number of times node t is in state k and its parents are in state c. (Technically 
these counts depend on the graph structure G, but we drop this from the notation.) 


Deriving the marginal likelihood 


Of course, choosing the graph with the maximum likelihood will always pick a fully connected 
graph (subject to the acyclicity constraint), since this maximizes the number of parameters. To 
avoid such overfitting, we will choose the graph with the maximum marginal likelihood, p(D|G); 
the magic of the Bayesian Occam’s razor will then penalize overly complex graphs. 

To compute the marginal likelihood, we need to specify priors on the parameters. We will 
make two standard assumptions. First, we assume global prior parameter independence, 
which means 


= 
p(8) = | [p0 (26.21) 
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Second, we assume local prior parameter independence, which means 


p(6:) = | pO) (26.22) 
for each t. It turns out that these assumtions imply that the prior for each row of each CPT 
must be a Dirichlet (Geiger and Heckerman 1997), that is, 

p(t) = Dir(0re|@te) (26.23) 


Given these assumptions, and using the results of Section 5.3.2.2, we can write down the 
marginal likelihood of any DAG as follows: 


V Ct 
pDIG) = [II] J [|  Cat(wil@:-) | Dir(O1c)dO rc (26.24) 

t=l1c=1 U:Lipa(t) =C 
Lace B(Ni + Ote) 

= II Ji Beg (26.25) 
t=] c=] Ate 

= Il Il T(Nic) Ke T'(Nick a5 atek) (26.26) 
t=1 c=1 P(Nic + ate) k=1 Daig) 
V 

= II score(N; pa(t)) (26.27) 
p= 


where Nic = Xy Neck Qte = >op Qtek? Nt pace) is the vector of counts (sufficient statistics) for 
node ¢ and its parents, and score() is a local scoring function defined by 


Ct 
score(N¢ pa(t)) 2 JI 


c=1 


B(Nic + Ote) 


26.28 
Han (26.28) 


We say that the marginal likelihood decomposes or factorizes according to the graph structure. 


Setting the prior 


How should we set the hyper-parameters a;,;? It is tempting to use a Jeffreys prior of the form 
Qtek = 4 (Equation 5.62). However, it turns out that this violates a property called likelihood 
equivalence, which is sometimes considered desirable. This property says that if G} and G2 are 
Markov equivalent (Section 26.4.1), they should have the same marginal likelihood, since they are 
essentially equivalent models. Geiger and Heckerman (1997) proved that, for complete graphs, 
the only prior that satisfies likelihood equivalence and parameter independence is the Dirichlet 
prior, where the pseudo counts have the form 


Qteck = Q Pole = k,Xpact) = €) (26.29) 


where a > 0 is called the equivalent sample size, and pp is some prior joint probability dis- 
tribution. This is called the BDe prior, which stands for Bayesian Dirichlet likelihood equivalent. 
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To derive the hyper-parameters for other graph structures, Geiger and Heckerman (1997) 
invoked an additional assumption called parameter modularity, which says that if node X; 
has the same parents in Gy and Go, then p(0:|G1) = p(@:|G2). With this assumption, we 
can always derive a; for a node t in any other graph by marginalizing the pseudo counts in 
Equation 26.29. 

Typically the prior distribution po is assumed to be uniform over all possible joint configura- 
tions. In this case, we have 


a 


ek See 26.30 
Qtek EG, ( ) 


since po(t_ = k, Xpa(t) = €) = ro Thus if we sum the pseudo counts over all C; x Ky 
entries in the CPT, we get a total equivalent sample size of a. This is called the BDeu prior, 
where the “u” stands for uniform. This is the most widely used prior for learning Bayes net 
structures. For advice on setting the global tuning parameter a, see (Silander et al. 2007). 


Simple worked example 


We now give a very simple worked example from (Neapolitan 2003, p.438). Suppose we have 
just 2 binary nodes, and the following 8 data cases: 


Xi X2 
1 1 
1 2 
1 1 
2 2 
1 1 
2 1 
1 1 
2 2 


Suppose we are interested in two possible graphs: G is X; —> X2 and G% is the disconnected 
graph. The empirical counts for node 1 in G; are N; = (5,3) and for node 2 are 
| X2=1 X,=2 


X,;=1/)4 1 
Xy=2)1 2 
The BDeu prior for Gi is a, = (a/2,a/2), Q)2,-1 = (a/4,a/4) and agjz,-2 = 


(a/4,a/4). For Gə, the prior for 6, is the same, and for 62 it is @gjz,-1 = (a/2,a/2) 
and Q2\z,-2 = (a/2,a/2). If we set a = 4, and use the BDeu prior, we find p(D|G,) = 
7.2150 x 1076 and p(D|G2) = 6.7465 x 1076. Hence the posterior probabilites, under a 
uniform graph prior, are p(G1|D) = 0.51678 and p(G2|D) = 0.48322. 


Example: analysis of the college plans dataset 


We now consider a more interesting example from (Heckerman et al. 1997). Consider the data 
set collected in 1968 by Sewell and Shah which measured 5 variables that might influence the 
decision of high school students about whether to attend college. Specifically, the variables are 
as follows: 
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Figure 26.7 The two most probable DAGs learned from the Sewell-Shah data. Source: (Heckerman et al. 
1997) . Used with kind permission of David Heckerman 


e Sex Male or female 

e SES Socio economic status: low, lower middle, upper middle or high. 

e IQ Intelligence quotient: discretized into low, lower middle, upper middle or high. 
e PE Parental encouragment: low or high 

e CP College plans: yes or no. 


These variables were measured for 10,318 Wisconsin high school seniors. There are 2 x 4 x 
4 x 2x = 128 possible joint configurations. 

Heckerman et al. computed the exact posterior over all 29,281 possible 5 node DAGs, except 
for ones in which SEX and/or SES have parents, and/or CP have children. (The prior probability 
of these graphs was set to 0, based on domain knowledge.) They used the BDeu score with 
a = 5, although they said that the results were robust to any a in the range 3 to 40. The top 
two graphs are shown in Figure 26.7. We see that the most probable one has approximately all 
of the probability mass, so the posterior is extremely peaked. 

It is tempting to interpret this graph in terms of causality (see Section 26.6). In particular, 
it seems that socio-economic status, IQ and parental encouragment all causally influence the 
decision about whether to go to college, which makes sense. Also, sex influences college plans 
only indirectly through parental encouragement, which also makes sense. However, the direct 
link from socio economic status to IQ seems surprising; this may be due to a hidden common 
cause. In Section 26.5.1.4 we will re-examine this dataset allowing for the presence of hidden 
variables. 


The K2 algorithm 


Suppose we know a total ordering of the nodes. Then we can compute the distribution over 
parents for each node independently, without the risk of introducing any directed cycles: we 


26.4.2.7 


26.4.3 


26.4.3.1 


920 Chapter 26. Graphical model structure learning 


simply enumerate over all possible subsets of ancestors and compute their marginal likelihoods.! 
If we just return the best set of parents for each node, we get the the K2 algorithm (Cooper 
and Herskovits 1992). 


Handling non-tabular CPDs 


If all CPDs are linear Gaussian, we can replace the Dirichlet-multinomial model with the normal- 
gamma model, and thus derive a different exact expression for the marginal likelihood. See 
(Geiger and Heckerman 1994) for the details. In fact, we can easily combine discrete nodes 
and Gaussian nodes, as long as the discrete nodes always have discrete parents; this is called a 
conditional Gaussian DAG. Again, we can compute the marginal likelihood in closed form. See 
(Bottcher and Dethlefsen 2003) for the details. 

In the general case (i.e., everything except Gaussians and CPTs), we need to approximate the 
marginal likelihood. The simplest approach is to use the BIC approximation, which has the form 


> KC 
J log p(D:|ĝ;) - =F% log N (26.31) 
t 


Scaling up to larger graphs 


The main challenge in computing the posterior over DAGs is that there are so many possible 
graphs. More precisely, (Robinson 1973) showed that the number of DAGs on D nodes satisfies 
the following recurrence: 


D 

oat DN 

ODDE a O U 2632 
i=1 g 

for D > 2. The base case is f(1) = 1. Solving this recurrence yields the following sequence: 

1, 3, 25, 543, 29281, 3781503, etc.” In view of the enormous size of the hypothesis space, we are 

generally forced to use approximate methods, some of which we review below. 


Approximating the mode of the posterior 


We can use dynamic programming to find the globally optimal MAP DAG (up to Markov equiv- 
alence) (Koivisto and Sood 2004; Silander and Myllmaki 2006). Unfortunately this method takes 
V2V time and space, making it intractable beyond about 16 nodes. Indeed, the general problem 
of finding the globally optimal MAP DAG is provably NP-complete (Chickering 1996), 
Consequently, we must settle for finding a locally optimal MAP DAG. The most common 
method is greedy hill climbing: at each step, the algorithm proposes small changes to the 
current graph, such as adding, deleting or reversing a single edge; it then moves to the neigh- 
boring graph which most increases the posterior. The method stops when it reaches a lo- 
cal maximum. It is important that the method only proposes local changes to the graph, 


1. We can make this method more efficient by using £1 -regularization to select the parents (Schmidt et al. 2007). In this 
case, we need to approximate the marginal likelhood as we discuss below. 

2. A longer list of values can be found at http: //www.research.att.com/“njas/sequences/A003024. Interest- 
ingly, the number of DAGs is equal to the number of (0,1) matrices all of whose eigenvalues are positive real numbers 
(McKay et al. 2004). 
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Figure 26.8 A locally optimal DAG learned from the 20-newsgroup data. From Figure 4.10 of (Schmidt 
2010). Used with kind permission of Mark Schmidt. 


since this enables the change in marginal likelihood (and hence the posterior) to be computed 
in constant time (assuming we cache the sufficient statistics). This is because all but one 
or two of the terms in Equation 26.25 will cancel out when computing the log Bayes factor 
6(G + G’) = log p(G'|D) — log p(G|D). 

We can initialize the search from the best tree, which can be found using exact methods 
discussed in Section 26.3. For speed, we can restrict the search so it only adds edges which are 
part of the Markov blankets estimated from a dependency network (Schmidt 2010). Figure 26.8 
gives an example of a DAG learned in this way from the 20-newsgroup data. 

We can use techniques such as multiple random restarts to increase the chance of finding a 
good local maximum. We can also use more sophisticated local search methods, such as genetic 
algorithms or simulated annealing, for structure learning. 


Approximating other functions of the posterior 


If our goal is knowledge discovery, the MAP DAG can be misleading, for reasons we discussed in 
Section 5.2.1. A better approach is to compute the probability that each edge is present, p(G st = 
1|D), of the probability there is a path from s to t. We can do this exactly using dynamic 
programming (Koivisto 2006; Parviainen and Koivisto 2011). Unfortunately these methods take 
V2” time in the general case, making them intractable for graphs with more than about 16 
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nodes. 

An approximate method is to sample DAGs from the posterior, and then to compute the 
fraction of times there is an s — t edge or path for each (s,t) pair. The standard way to draw 
samples is to use the Metropolis Hastings algorithm (Section 24.3), where we use the same local 
proposal as we did in greedy search (Madigan and Raftery 1994). 

A faster-mixing method is to use a collapsed MH sampler, as suggested in (Friedman and 
Koller 2003). This exploits the fact that, if a total ordering of the nodes is known, we can 
select the parents for each node independently, without worrying about cycles, as discussed in 
Section 26.4.2.6. By summing over all possible choice of parents, we can marginalize out this 
part of the problem, and just sample total orders. (Ellis and Wong 2008) also use order-space 
(collapsed) MCMC, but this time with a parallel tempering MCMC algorithm. 


Learning DAG structure with latent variables 


Sometimes the complete data assumption does not hold, either because we have missing data, 
and/ or because we have hidden variables. In this case, the marginal likelihood is given by 


p(D|G) = J Xv@.nle, cyp(o\G)20 = > [ (,n16, @\p(6\G)a6 (26.33) 
h h 


where h represents the hidden or missing data. 

In general this is intractable to compute. For example, consider a mixture model, where 
we don’t observe the cluster label. In this case, there are K possible completions of the 
data (assuming we have K clusters); we can evaluate the inner integral for each one of these 
assignments to h, but we cannot afford to evaluate all of the integrals. (Of course, most of these 
integrals will correspond to hypotheses with little posterior support, such as assigning single 
data points to isolated clusters, but we don’t know ahead of time the relative weight of these 
assignments.) 

In this section, we discuss some ways for learning DAG structure when we have latent variables 
and/or missing data. 


Approximating the marginal likelihood when we have missing data 


The simplest approach is to use standard structure learning methods for fully visible DAGs, 
but to approximate the marginal likelihood. In Section 24.7, we discussed some Monte Carlo 
methods for approximating the marginal likelihood. However, these are usually too slow to use 
inside of a search over models. Below we mention some faster deterministic approximations. 


BIC approximation 


A simple approximation is to use the BIC score, which is given by 


5 log N 
BIC(G) ê log p(D|8, G) — = dim(G) (26.34) 
where dim(G) is the number of degrees of freedom in the model and @ is the MAP or ML 
estimate. However, the BIC score often severely underestimates the true marginal likelihood 
(Chickering and Heckerman 1997), resulting in it selecting overly simple models. 
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Cheeseman-Stutz approximation 


We now present a better method known as the Cheeseman-Stutz approximation (CS) (Cheese- 
man and Stutz 1996). We first compute a MAP estimate of the parameters ô (e.g, using EM). 
Denote the expected sufficient statistics of the data by D = D(@); in the case of discrete 
variables, we just “fill in” the hidden variables with their expectation. We then use the exact 
marginal likelihood equation on this filled-in data: 


p(D|G) ~ p(D|G) = [ Ple, cr(0\G)49 (26.35) 


However, comparing this to Equation 26.33, we can see that the value will be exponentially 
smaller, since it does not sum over all values of h. To correct for this, we first write 


log p(D|G) = log p(D|G) + log p(D|G) — log p(D|G) (26.36) 


and then we apply a BIC approximation to the last two terms: 


log p(D|G) —logp(D|G) = ow p(D|0, G) — A dim(ô)| (26.37) 
= og p(D|0, G) — a aim(@)| (26.38) 
= logp(D|@,G) — log p(D\@, G) (26.39) 
Putting it altogether we get 
log p(D|G) ~ log p(D|G) + log p(D|ĝ, G) — log p(D]@, G) (26.40) 


The first term p(D|G) can be computed by plugging in the filled-in data into the exact marginal 
likelihood. The second term p(D|Ô, G), which involves an exponential sum (thus matching the 
“dimensionality” of the left hand side) can be computed using an inference algorithm. The final 
term p(D|@,G) can be computed by plugging in the filled-in data into the regular likelihood. 


Variational Bayes EM 


An even more accurate approach is to use the variational Bayes EM algorithm. Recall from 
Section 21.6 that the key idea is to make the following factorization assumption: 


p(9, Z1.n|D) ~ q(8)q(z) = q(8) I] q(Zi) (26.41) 


where z; are the hidden variables in case i. In the E step, we update the q(z;), and in the 
M step, we update q(0). The corresponding variational free energy provides a lower bound on 
the log marginal likelihood. In (Beal and Ghahramani 2006), it is shown that this bound is a 
much better approximation to the true log marginal likelihood (as estimated by a slow annealed 
importance sampling procedure) than either BIC or CS. In fact, one can prove that the variational 
bound will always be more accurate than CS (which in turn is always more accurate than BIC). 
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Figure 26.9 The most probable DAG with a single binary hidden variable learned from the Sewell-Shah 
data. MAP estimates of the CPT entries are shown for some of the nodes. Source: (Heckerman et al. 1997). 
Used with kind permission of David Heckerman. 


Example: college plans revisited 


Let us revisit the college plans dataset from Section 26.4.2.5. Recall that if we ignore the 
possibility of hidden variables there was a direct link from socio economic status to IQ in the 
MAP DAG. Heckerman et al. decided to see what would happen if they introduced a hidden 
variable H, which they made a parent of both SES and IQ, representing a hidden common cause. 
They also considered a variant in which H points to SES, IQ and PE. For both such cases, they 
considered dropping none, one, or both of the SES-PE and PE-IQ edges. They varied the number 
of states for the hidden node from 2 to 6. Thus they computed the approximate posterior over 
8 x 5 = 40 different models, using the CS approximation. 

The most probable model which they found is shown in Figure 26.9. This is 2 - 101° times 
more likely than the best model containing no hidden variable. It is also 5 - 10° times more 
likely than the second most probable model with a hidden variable. So again the posterior is 
very peaked. 

These results suggests that there is indeed a hidden common cause underlying both the 
socio-economic status of the parents and the IQ of the children. By examining the CPT entries, 
we see that both SES and IQ are more likely to be high when H takes on the value 1. They 
interpret this to mean that the hidden variable represents “parent quality” (possibly a genetic 
factor). Note, however, that the arc between H and SES can be reversed without changing the v- 
structures in the graph, and thus without affecting the likelihood; this underscores the difficulty 
in interpreting hidden variables. 

Interestingly, the hidden variable model has the same conditional independence assumptions 
amongst the visible variables as the most probable visible variable model. So it is not pos- 
sible to distinguish between these hypotheses by merely looking at the empirical conditional 
independencies in the data (which is the basis of the constraint-based approach to structure 
learning (Pearl and Verma 1991; Spirtes et al. 2000)). Instead, by adopting a Bayesian approach, 
which takes parsimony into account (and not just conditional independence), we can discover 
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the possible existence of hidden factors. This is the basis of much of scientific and everday 
human reasoning (see e.g. (Griffiths and Tenenbaum 2009) for a discussion). 


Structural EM 


One way to perform structural inference in the presence of missing data is to use a standard 
search procedure (deterministic or stochastic), and to use the methods from Section 26.5.1 to 
estimate the marginal likelihood. However, this approach is very efficient, because the marginal 
likelihood does not decompose when we have missing data, and nor do its approximations. 
For example, if we use the CS approximation or the VBEM approximation, we have to perform 
inference in every neighboring model, just to evaluate the quality of a single move! 

(Friedman 1997b; Thiesson et al. 1998) presents a much more efficient approach called the 
structural EM algorithm. The basic idea is this: instead of fitting each candidate neighboring 
graph and then filling in its data, fill in the data once, and use this filled-in data to evaluate 
the score of all the neighbors. Although this might be a bad approximation to the marginal 
likelihood, it can be a good enough approximation of the difference in marginal likelihoods 
between different models, which is all we need in order to pick the best neighbor. 

More precisely, define D(Go, ôo) to be the data filled in using model Go with MAP parameters 
ôo. Now define a modified BIC score as follows: 


log N 


scorepic(G, D) £ log p(D|Ô, G) — dim(G) + log p(G) + log p(Ô|G)) (26.42) 
where we have included the log prior for the graph and parameters. One can show (Friedman 
1997b) that if we pick a graph G which increases the BIC score relative to Go on the expected 
data, it will also increase the score on the actual data, i.e., 


scorepic(G, D(Go, 6o)) — scorepic (Go, D(Go, ĝo) < scorepic(G’, D) — scoregjc (Go, D(26.43) 


To convert this into an algorithm, we proceed as follows. First we initialize with some graph 
Go and some set of parameters 0o. Then we fill-in the data using the current parameters — in 
practice, this means when we ask for the expected counts for any particular family, we perform 
inference using our current model. (If we know which counts we will need, we can precompute 
all of them, which is much faster.) We then evaluate the BIC score of all of our neighbors using 
the filled-in data, and we pick the best neighbor. We then refit the model parameters, fill-in the 
data again, and repeat. For increased speed, we may choose to only refit the model every few 
steps, since small changes to the structure hopefully won't invalidate the parameter estimates 
and the filled-in data too much. 

One interesting application is to learn a phylogenetic tree structure. Here the observed leaves 
are the DNA or protein sequences of currently alive species, and the goal is to infer the topology 
of the tree and the values of the missing internal nodes. There are many classical algorithms for 
this task (see e.g., (Durbin et al. 1998)), but one that uses SEM is discussed in (Friedman et al. 
2002). 

Another interesting application of this method is to learn sparse mixture models (Barash and 
Friedman 2002). The idea is that we have one hidden variable C specifying the cluster, and we 
have to choose whether to add edges C + X; for each possible feature X+. Thus some features 
will be dependent on the cluster id, and some will be independent. (See also (Law et al. 2004) 
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Figure 26.10 Part of a hierarchical latent tree learned from the 20-newsgroup data. From Figure 2 of 
(Harmeling and Williams 2011). Used with kind permission of Stefan Harmeling. 


for a different way to perform this task, using regular EM and a set of bits, one per feature, that 
are free to change across data cases.) 


Discovering hidden variables 


In Section 26.5.1.4, we introduced a hidden variable “by hand”, and then figured out the local 
topology by fitting a series of different models and computing the one with the best marginal 
likelihood. How can we automate this process? 

Figure 11.1 provides one useful intuition: if there is a hidden variable in the “true model”, 
then its children are likely to be densely connected. This suggest the following heuristic (Elidan 
et al. 2000): perform structure learning in the visible domain, and then look for structural 
signatures, such as sets of densely connected nodes (near-cliques); introduce a hidden variable 
and connect it to all nodes in this near-clique; and then let structural EM sort out the details. 
Unfortunately, this technique does not work too well, since structure learning algorithms are 
biased against fitting models with densely connected cliques. 

Another useful intuition comes from clustering. In a flat mixture model, also called a latent 
class model, the discrete latent variable provides a compressed representation of its children. 
Thus we want to create hidden variables with high mutual information with their children. 

One way to do this is to create a tree-structured hierarchy of latent variables, each of which 
only has to explain a small set of children. (Zhang 2004) calls this a hierarchical latent class 
model. They propose a greedy local search algorithm to learn such structures, based on adding 
or deleting hidden nodes, adding or deleting edges, etc. (Note that learning the optimal latent 
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Figure 26.11 A partially latent tree learned from the 20-newsgroup data. Note that some words can 
have multiple meanings, and get connected to different latent variables, representing different “topics”. For 
example, the word “win” can refer to a sports context (represented by h5) or the Microsoft Windows context 
(represented by h25). From Figure 12 of (Choi et al. 2011). Used with kind permission of Jin Choi. 


tree is NP-hard (Roch 2006).) 

Recently (Harmeling and Williams 2011) proposed a faster greedy algorithm for learning such 
models based on agglomerative hierarchical clustering. Rather than go into details, we just give 
an example of what this system can learn. Figure 26.10 shows part of a latent forest learned 
from the 20-newsgroup data. The algorithm imposes the constraint that each latent node has 
exactly two children, for speed reasons. Nevertheless, we see interpretable clusters arising. For 
example, Figure 26.10 shows separate clusters concerning medicine, sports and religion. This 
provides an alternative to LDA and other topic models (Section 4.2.2), with the added advantage 
that inference in latent trees is exact and takes time linear in the number of nodes. 

An alternative approach is proposed in (Choi et al. 2011), in which the observed data is not 
constrained to be at the leaves. This method starts with the Chow-Liu tree on the observed 
data, and then adds hidden variables to capture higher-order dependencies between internal 
nodes. This results in much more compact models, as shown in Figure 26.11. This model also 
has better predictive accuracy than other approaches, such as mixture models, or trees where 
all the observed data is forced to be at the leaves. Interestingly, one can show that this method 
can recover the exact latent tree structure, providing the data is generated from a tree. See 
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Figure 26.12 Google’ rephil model. Leaves represent presence or absence of words. Internal nodes 
represent clusters of co-occuring words, or “concepts”. All nodes are binary, and all CPDs are noisy-OR. 
The model contains 12 million word nodes, 1 million latent cluster nodes, and 350 million edges. Used 
with kind permission of Brian Milch. 


(Choi et al. 2011) for details. Note, however, that this approach, unlike (Zhang 2004; Harmeling 
and Williams 2011), requires that the cardinality of all the variables, hidden and observed, be 
the same. Furthermore, if the observed variables are Gaussian, the hidden variables must be 
Gaussian also. 


Case study: Google’s Rephil 


In this section, we describe a huge DGM called Rephil, which was automatically learned from 
data. The model is widely used inside Google for various purposes, including their famous 
AdSense system.’ 

The model structure is shown in Figure 26.12. The leaves are binary nodes, and represent 
the presence or absence of words or compounds (such as “New York City”) in a text document 
or query. The latent variables are also binary, and represent clusters of co-occuring words. All 
CPDs are noisy-OR, since some leaf nodes (representing words) can have many parents. This 
means each edge can be augmented with a hidden variable specifying if the link was activated 
or not; if the link is not active, then the parent cannot turn the child on. (A very similar model 
was proposed independently in (Singliar and Hauskrecht 2006).) 

Parameter learning is based on EM, where the hidden activation status of each edge needs 
to be inferred (Meek and Heckerman 1997). Structure learning is based on the old neuroscience 


3. The original system, called “Phil”, was developed by Georges Harik and Noam Shazeer,. It has been published as US 
Patent #8024372, “Method and apparatus for learning a probabilistic generative model for text”, filed in 2004. Rephil is 
a more probabilistically sound version of the method, developed by Uri Lerner et al. The summary below is based on 
notes by Brian Milch (who also works at Google). 

4. AdSense is Google’s system for matching web pages with content-appropriate ads in an automatic way, by extracting 
semantic keywords from web pages. These keywords play a role analogous to the words that users type in when 
searching; this latter form of information is used by Google’s AdWords system. The details are secret, but (Levy 2011) 
gives an overview. 
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idea that “nodes that fire together should wire together”. To implement this, we run inference 
and check for cluster-word and cluster-cluster pairs that frequently turn on together. We then 
add an edge from parent to child if the link can significantly increase the probability of the 
child. Links that are not activated very often are pruned out. We initialize with one cluster per 
“document” (corresponding to a set of semantically related phrases). We then merge clusters A 
and B if A explains B’s top words and vice versa. We can also discard clusters that are used 
too rarely. 

The model was trained on about 100 billion text snippets or search queries; this takes several 
weeks, even on a parallel distributed computing architecture. The resulting model contains 12 
million word nodes and about 1 million latent cluster nodes. There are about 350 million links 
in the model, including many cluster-cluster dependencies. The longest path in the graph has 
length 555, so the model is quite deep. 

Exact inference in this model is obviously infeasible. However note that most leaves will be 
off, since most words do not occur in a given query; such leaves can be analytically removed, as 
shown in Exercise 10.7. We an also prune out unlikely hidden nodes by following the strongest 
links from the words that are on up to their parents to get a candidate set of concepts. We 
then perform iterative conditional modes to find a good set of local maxima. At each step of 
ICM, each node sets its value to its most probable state given the values of its neighbors in its 
Markov blanket. This continues until it reaches a local maximum. We can repeat this process 
a few times from random starting configurations. At Google, this can be made to run in 15 
milliseconds! 


Structural equation models * 


A structural equation model (Bollen 1989) is a special kind of directed mixed graph (Sec- 
tion 19.4.4.1), possibly cyclic, in which all CPDs are linear Gaussian, and in which all bidirected 
edges represent correlated Gaussian noise. Such models are also called path diagrams. SEMs 
are widely used, especially in economics and social science. It is common to interpret the edge 
directions in terms of causality, where directed cycles are interpreted is in terms of feedback 
loops (see e.g., (Pearl 2000, Ch.5)). However, the model is really just a way of specifying a joint 
Gaussian, as we show below. There is nothing inherently “causal” about it at all. (We discuss 
causality in Section 26.6.) 
We can define an SEM as a series of full conditionals as follows: 


Ti = Hi +Y Wijtj te (26.44) 
jži 
where e ~ N (0, W). We can rewrite the model in matrix form as follows: 
x=Wx+u+e>x=(I-W)t(e+ n) (26.45) 
Hence the joint distribution is given by p(x) = N (u, ©) where 
y=(1- WwW) 'w(i-w)7 (26.46) 


We draw an arc X; + X; if |w,;| > 0. If W is lower triangular then the graph is acyclic. If, 
in addition, Ų is diagonal, then the model is equivalent to a Gaussian DGM, as discussed in 
Section 10.2.5; such models are called recursive. If W is not diagonal, then we draw a bidirected 
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Figure 26.13 A cyclic directed mixed graphical model (non-recursive SEM). Note the Zı > Z2 > Z3 > 
Z, feedback loop. 


arc X; 4+ X; for each non-zero off-diagonal term. Such edges represent correlation, possibly 
due to a hidden common cause. 

When using structural equation models, it is common to partition the variables into latent 
variables, Z;, and observed or manifest variables Y;. For example, Figure 26.13 illustrates the 
following model: 


XY Zı 0 0 w3 0 0 0 Zi é 
X2 Z2 wa 0 0 0 0 0 Zo E2 
X3 L Z3 = 0 W32 0 0 0 0 Z3 €3 
Xi] z| 0 0 0 0 0] |v] yet? (26.47) 
X5 Y> 0 W52 0 0 0 0 Y> €5 
where 
Vir 0 0 0 0 0 
0 Wo 0 0 0 0 
_| 9 0 Ws3 0 0 0 
=> |. 0 o tu De i (26.48) 
0 0 0 Ws54 W55 0 


0 0 0 0 0 Ve 


The presence of a feedback loop 7; —> Z2 — Z3 is evident from the fact that W is not lower 
triangular. Also the presence of confounding between Y; and Y> is evident in the off-diagonal 
terms in W. 

Often we assume there are multiple observations for each latent variable. To ensure identifia- 
bility, we can set the mean of the latent variables Z; to 0, and we can set the regression weights 
of Z, — Y; to 1. This essentially defines the scale of each latent variable. (In addition to the 
Z’s, there are the extra hidden variables implied by the presence of the bidirected edges.) 

The standard practice in the SEM community, as exemplified by the popular commercial 
software package called LISREL (available from http: //www.ssicentral.com/lisrel1/), is to 
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build the structure by hand, to estimate the parameters by maximum likelihood, and then to 
test if any of the regression weights are significantly different from 0, using standard frequentist 
methods. However, one can also use Bayesian inference for the parameters (see e.g., (Dunson 
et al. 2005)). Structure learning in SEMs is rare, but since recursive SEMs are equivalent to 
Gaussian DAGs, many of the techniques we have been discussing in this section can be applied. 

SEMs are closely related to factor analysis (FA) models (Chapter 12). The basic difference is 
that in an FA model, the latent Gaussian has a low-rank covariance matrix, and the observed 
noise has a diagonal covariance (hence no bidirected edges). In an SEM, the covariance of the 
latent Gaussian has a sparse Cholesky decomposition (at least if W is acyclic), and the observed 
noise might have a full covariance matrix. 

Note that SEMs can be extended in many ways. For example, we can add covariates/ input 
variables (possibly noisily observed), we can make some of the observations be discrete (e.g., by 
using probit links), and so on. 


Learning causal DAGs 


Causal models are models which can predict the effects of interventions to, or manipulations 
of, a system. For example, an electronic circuit diagram implicitly provides a compact encoding 
of what will happen if one removes any given component, or cuts any wire. A causal medical 
model might predict that if I continue to smoke, I am likely to get lung cancer (and hence if 
I cease smoking, I am less likely to get lung cancer). Causal claims are inherently stronger, 
yet more useful, than purely associative claims, such as “people who smoke often have lung 
cancer”. 

Causal models are often represented by DAGs (Pearl 2000), although this is somewhat contro- 
versial (Dawid 2010). We explain this causal interpretation of DAGs below. We then show how 
to use a DAG to do causal reasoning. Finally, we briefly discuss how to learn the structure of 
causal DAGs. A more detailed description of this topic can be found in (Pearl 2000) and (Koller 
and Friedman 2009, Ch.21). 


Causal interpretation of DAGs 


In this section, we define a directed edge A — B in a DAG to mean that “A directly causes B”, 
so if we manipulate A, then B will change. This is known as the causal Markov assumption. 
(Of course, we have not defined the word “causes”, and we cannot do that by appealing to a 
DAG, lest we end up with a cyclic definition; see (Dawid 2010) for further disussion of this point.) 

We will also assume that all relevant variables are included in the model, i.e., there are no 
unknown confounders, reflecting hidden common causes. This is called the causal sufficiency 
assumption. (If there are known to be confounders, they should be added to the model, although 
one can sometimes use mixed directed graphs (Section 26.5.5) as a way to avoid having to model 
confounders explicitly.) 

Assuming we are willing to make the causal Markov and causal sufficiency assumptions, we 
can use DAGs to answer causal questions. The key abstraction is that of a perfect intervention; 
this represents the act of setting a variable to some known value, say setting X; to x;. A real 
world example of such a perfect intervention is a gene knockout experiment, in which a gene 
is “silenced”. We need some notational convention to distinguish this from observing that X; 
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Figure 26.14 Surgical intervention on X. Based on (Pe'er 2005). 


happens to have value x;. We use Pearl’s do calculus notation (as in the verb “to do”) and write 
do(X; = 2;) to denote the event that we set X; to xi. A causal model can be used to make 
inferences of the form p(x|do(X; = 2;)), which is different from making inferences of the form 
p(x|X; = x3). 

To understand the difference between conditioning on interventions and conditioning on 
observations (i.e., the difference between doing and seeing), consider a 2 node DGM S —> Y, in 
which S = 1 if you smoke and S = 0 otherwise, and Y = 1 if you have yellow-stained fingers, 
and Y = 0 otherwise. If I observe you have yellow fingers, I am licensed to infer that you are 
probably a smoker (since nicotine causes yellow stains): 


p(S = 1]/Y = 1) > p(S = 1) (26.49) 


However, if I intervene and paint your fingers yellow, I am no longer licensed to infer this, since 
I have disrupted the normal causal mechanism. Thus 


p(S = 1\do(Y = 1)) = p(S = 1) (26.50) 


One way to model perfect interventions is to use graph surgery: represent the joint distri- 
bution by a DGM, and then cut the arcs coming into any nodes that were set by intervention. 
See Figure 26.14 for an example. This prevents any information flow from the nodes that were 
intervened on from being sent back up to their parents. Having perform this surgery, we can 
then perform probabilistic inference in the resulting “mutilated” graph in the usual way to reason 
about the effects of interventions. We state this formally as follows. 


Theorem 26.6.1 (Manipulation theorem (Pearl 2000; Spirtes et al. 2000)). . To compute p(X;|do(X;)) 
for sets of nodes i, j, we can perform surgical intervention on the X; nodes and then use standard 
probabilistic inference in the mutilated graph. 


We can generalize the notion of a perfect intervention by adding interventions as explicit 
action nodes to the graph. The result is like an influence diagram, except there are no utility 
nodes (Lauritzen 2000; Dawid 2002). This has been called the augmented DAG (Pearl 2000). We 
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Figure 26.15 Illustration of Simpson's paradox. Figure generated by simpsonsParadoxGraph. 


can then define the CPD p(X;|do(X;)) to be anything we want. We can also allow an action to 
affect multiple nodes. This is called a fat hand intervention, a reference to someone trying to 
change a single component of some system (e.g., an electronic circuit), but accidently touching 
multiple components and thereby causing various side effects (see (Eaton and Murphy 2007) for 
a way to model this using augmented DAGs). 


Using causal DAGs to resolve Simpson’s paradox 


In this section, we assume we know the causal DAG. We can then do causal reasoning by 
applying d-separation to the mutilated graph. In this section, we give an example of this, and 
show how causal reasoning can help resolve a famous paradox, known as Simpon’s paradox. 

Simpson’s paradox says that any statistical relationship between two variables can be reversed 
by including additional factors in the analysis. For example, suppose some cause C (say, taking 
a drug) makes some effect E (say getting better) more likely 


P(E|C) > P(E|AC) 
and yet, when we condition on the gender of the patient, we find that taking the drug makes 
the effect less likely in both females (F) and males (~F): 
P(E|IC,F) < P(E|AC,F) 
P(E|C, =F) < P(E|AC,AF) 
This seems impossible, but by the rules of probability, this is perfectly possible, because the 
event space where we condition on (=C, F) or (=C,—F') can be completely different to the 


event space when we just condition on =C. The table of numbers below shows a concrete 
example (from (Pearl 2000, p175)): 


Combined Male Female 
E =E Total Rate | Æ =E Total Rate | E ~E Total Rate 
G 20 20 40 50% 18 12 30 60% 2 8 10 20% 
aC 16 24 40 40% 7 3 10 70% 9 21 30 30% 
Total | 36 44 80 25 15 40 ll 29 40 
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From this table of numbers, we see that 


p(E|C) =20/40=0.5 > p(E|AC) = 16/40 = 0.4 (26.51) 
p(E|C, F) =2/10=0.2 < p(E|AC,F) = 9/30 = 0.3 (26.52) 
p(B|C, =F) = 18/30 = 0.6 < p(Eļ=—, ~F) = 7/10 = 0.7 (26.53) 


A visual representation of the paradox is given in in Figure 26.15. The line which goes up and 
to the right shows that the effect (y-axis) increases as the cause (x-axis) increases. However, the 
dots represent the data for females, and the crosses represent the data for males. Within each 
subgroup, we see that the effect decreases as we increase the cause. 

It is clear that the effect is real, but it is still very counter-intuitive. The reason the paradox 
arises is that we are interpreting the statements causally, but we are not using proper causal 
reasoning when performing our calculations. The statement that the drug C causes recovery Æ 
is 

P(E|do(C)) > P(Eļdo(=7C)) (26.54) 
whereas the data merely tell us 


P(E|C) > P(E|=C) (26.55) 


This is not a contradiction. Observing C is positive evidence for Æ, since more males than 
females take the drug, and the male recovery rate is higher (regardless of the drug). Thus 
Equation 26.55 does not imply Equation 26.54. 

Nevertheless, we are left with a practical question: should we use the drug or not? It seems 
like if we don’t know the patient’s gender, we should use the drug, but as soon as we discover 
if they are male or female, we should stop using it. Obviously this conclusion is ridiculous. 

To answer the question, we need to make our assumptions more explicit. Suppose reality can 
be modeled by the causal DAG in Figure 26.16(a). To compute the causal effect of C on E, we 
need to adjust for (i.e., condition on) the confounding variable F. This is necessary because 
there is a backdoor path from C to E via F, so we need to check the C —> E relationship for 
each value of F separately, to make sure the relationship between C and E is not affected by 
any value of F. 

Suppose that for each value of F, taking the drug is harmful, that is, 


p(E|do(C),F) < p(£\do(-C), F) (26.56) 

p(E|do(C), =F) < p(Eļdo( ~C), ~F) (26.57) 
Then we can show that taking the drug is harmful overall: 

p(E|do(C)) < p(E|ļdo(=C)) (26.58) 


The proof is as follows (Pearl 2000, p181). First, from our assumptions in Figure 26.16(a), we see 
that drugs have no effect on gender 


p(F|do(C)) = p(F|do(=C)) = p(F’) (26.59) 
Now using the law of total probability, 

P(Eldo(C)) = p(E|do(C), F)p(F|do(C)) + p(Eldo(C), +F)p(+F|do(C)) (26.60) 

= p(E\do(C), F)p(F) + p(Eldo(C), >F)p(+F) (26.61) 
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Treatment Gender 


Treatment Blood Pressure 
C F C F 
E E 
Recovery Recovery 
(a) (b) 


Figure 26.16 Two different models uses to illustrate Simpson's paradox. (a) F is gender and is a confounder 
for C and E. (b) F is blood pressure and is caused by C. 


Similarly, 
P(Eldo(-C)) = p(£\do(-C), F)p(F) + p(E|do(-C), +F')p(-F) (26.62) 


Since every term in Equation 26.61 is less than the corresponding term in Equation 26.62, we 
conclude that 


p(E|do(C)) < p(B|do(3C)) (26.63) 


So if the model in Figure 26.16(a) is correct, we should not administer the drug, since it reduces 
the probability of the effect. 

Now consider a different version of this example. Suppose we keep the data the same but 
interpret F as something that is affected by C, such as blood pressure. See Figure 26.16(b). In 
this case, we can no longer assume 


p(F|do(C)) = p(F|do(-C)) = p(F) (26.64) 


and the above proof breaks down. So p(£|do(C)) — p(E|do(-=C)) may be positive or negaitve. 

In the true model is Figure 26.16(b), then we should not condition on F when assessing the 
effect of C on EF, since there is no backdoor path in this case, because of the v-structure at 
F. That is, conditioning on F might block one of the causal pathways. In other words, by 
comparing patients with the same post-treatment blood pressure (value of F), we may mask the 
effect of one of the two pathways by which the drug operates to bring about recovery. 

Thus we see that different causal assumptions lead to different causal conclusions, and hence 
different courses of action. This raises the question on whether we can learn the causal model 
from data. We discuss this issue below. 


Learning causal DAG structures 


In this section, we discuss some ways to learn causal DAG structures. 
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Learning from observational data 


In Section 26.4, we discussed various methods for learning DAG structures from observational 
data. It is natural to ask whether these methods can recover the “true” DAG structure that was 
used to generate the data. Clearly, even if we have infinite data, an optimal method can only 
identify the DAG up to Markov equivalence (Section 26.4.1). That is, it can identify the PDAG 
(partially directed acylic graph), but not the complete DAG structure, because all DAGs which are 
Markov equivalent have the same likelihood. 

There are several algorithms (e.g., the greedy equivalence search method of (Chickering 
2002)) that are consistent estimators of PDAG structure, in the sense that they identify the 
true Markov equivalence class as the sample size goes to infinity, assuming we observe all the 
variables. However, we also have to assume that the generating distribution p is faithful to 
the generating DAG G. This means that all the conditional indepence (CI) properties of p are 
exactly captured by the graphical structure, so I(p) = I(G); this means there cannot be any CI 
properties in p that are due to particular settings of the parameters (such as zeros in a regression 
matrix) that are not graphically explicit. For this reason, a faithful distribution is also called a 
stable distribution. 

Suppose the assumptions hold and we learn a PDAG. What can we do with it? Instead of 
recovering the full graph, we can focus on the causal analog of edge marginals, by computing 
the magnitude of the causal effect of one node on another (say A on B). If we know the DAG, we 
can do this using techniques described in (Pearl 2000). If the DAG is unknown, we can compute 
a lower bound on the effect as follows (Maathuis et al. 2009): learn an equivalence class (PDAG) 
from data; enumerate all the DAGs in the equivalence class; apply Pearl’s do-calculus to compute 
the magnitude of the causal effect of A on B in each DAG; finally, take the minimum of these 
effects as the lower bound. It is usually computationally infeasible to compute all DAGs in the 
equivalence class, but fortunately one only needs to be able to identify the local neighborhood 
of A and B, which can be esimated more efficiently, as described in (Maathuis et al. 2009). This 
technique is called IDA, which is short for “intervention-calculus when the DAG is absent”. 

In (Maathuis et al. 2010), this technique was applied to some yeast gene expression data. Gene 
knockout data was used to estimate the “ground truth” effect of each 234 single-gene deletions 
on the remaining 5,361 genes. Then the algorithm was applied to 63 unperturbed (wild-type) 
samples, and was used to rank order the likely targets of each of the 234 genes. The method 
had a precision of 66% when the recall was set to 10%; while low, this is substantially more than 
rival variable-selection methods, such as lasso and elastic net, which were only slightly above 
chance. 


Learning from interventional data 


If we want to distinguish between DAGs within the equivalence class, we need to use interven- 
tional data, where certain variables have been set, and the consequences have been measured. 
An example of this is the dataset in Figure 26.17(a), where proteins in a signalling pathway 
were perturbed, and their phosphorylation status was measured using a technique called flow 
cytometry (Sachs et al. 2005). 

It is straightforward to modify the standard Bayesian scoring criteria, such as the marginal 
likelihood or BIC score, to handle learning from mixed observational and experimental data: we 
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Sachs Multiparameter Flow Cytometry Dataset ene Ariembad 


data point 


0 
raf mek12 plcy pip2 pip3 emk akt pka 
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Figure 26.17 (a) A design matrix consisting of 5400 data points (rows) measuring the status (using flow 
cytometry) of 11 proteins (columns) under different experimental conditions. The data has been discretized 
into 3 states: low (black), medium (grey) and high (white). Some proteins were explicitly controlled using 
activating or inhibiting chemicals. (b) A directed graphical model representing dependencies between 
various proteins (blue circles) and various experimental interventions (pink ovals), which was inferred from 
this data. We plot all edges for which p(Gs: = 1|D) > 0.5. Dotted edges are believed to exist in nature 
but were not discovered by the algorithm (1 false negative). Solid edges are true positives. The light colored 
edges represent the effects of intervention. Source: Figure 6d of (Eaton and Murphy 2007) . This figure can 
be reproduced using the code at http: //www.cs.ubc.ca/~murphyk/Software/BDAGL/index.html. 
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just compute the sufficient statistics for a CPD’s parameter by skipping over the cases where that 
node was set by intervention (Cooper and Yoo 1999). For example, when using tabular CPDs, we 
modify the counts as follows: 


Ner >) Maie = k, Xipat) =o (26.65) 


1:24, NOt set 


The justification for this is that in cases where node t is set by force, it is not sampled from its 
usual mechanism, so such cases should be ignored when inferring the parameter 0+. The mod- 
ified scoring criterion can be combined with any of the standard structure learning algorithms. 
(He and Geng 2009) discusses some methods for choosing which interventions to perform, so 
as to reduce the posterior uncertainty as quickly as possible (a form of active learning). 

The preceeding method assumes the interventions are perfect. In reality, experimenters can 
rarely control the state of individual molecules. Instead, they inject various stimulant or inhibitor 
chemicals which are designed to target specific molecules, but which may have side effects. We 
can model this quite simply by adding the intervention nodes to the DAG, and then learning 
a larger augmented DAG structure, with the constraint that there are no edges between the 
intervention nodes, and no edges from the “regular” nodes back to the intervention nodes. 

Figure 26.17(b) shows the augmented DAG that was learned from the interventional flow 
cytometry data depicted in Figure 26.17(a). In particular, we plot the median graph, which 
includes all edges for which p(Gi; = 1|D) > 0.5. These were computed using the exact 
algorithm of (Koivisto 2006). It turns out that, in this example, the median model has exactly 
the same structure as the optimal MAP model, argmaxg p(G|D), which was computed using 
the algorithm of (Koivisto and Sood 2004; Silander and Myllmaki 2006). 


Learning undirected Gaussian graphical models 


Learning the structured of undirected graphical models is easier than learning DAG structure 
because we don't need to worry about acyclicity. On the other hand, it is harder than learning 
DAG structure since the likelihood does not decompose (see Section 19.5). This precludes the 
kind of local search methods (both greedy search and MCMC sampling) we used to learn DAG 
structures, because the cost of evaluating each neighboring graph is too high, since we have to 
refit each model from scratch (there is no way to incrementally update the score of a model). 

In this section, we discuss several solutions to this problem, in the context of Gaussian 
random fields or undirected Gaussian graphical models (GGM)s. We consider structure learning 
for discrete undirected models in Section 26.8. 


MLE for a GGM 


Before discussing structure learning, we need to discuss parameter estimation. The task of 
computing the MLE for a (non-decomposable) GGM is called covariance selection (Dempster 
1972). 

From Equation 4.19, the log likelihood can be written as 


L(Q) = log det Q — tr(SQ) (26.66) 
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where Q = X~! is the precision matrix, and S = a Soy (xi — X) (x; — x)” is the empirical 
covariance matrix. (For notational simplicity, we assume we have already estimated fz = X.) 


One can show that the gradient of this is given by 
ven) =a -S (26.67) 


However, we have to enforce the constraints that Qs = 0 if Gss = O (structural zeros), and 
that Q is positive definite. The former constraint is easy to enforce, but the latter is somewhat 
challenging (albeit still a convex constraint). One approach is to add a penalty term to the 
objective if Q leaves the positive definite cone; this is the approach used in ggmFitMinfunc 
(see also (Dahl et al. 2008)). Another approach is to use a coordinate descent method, described 
in (Hastie et al. 2009, p633), and implemented in ggmFitHtf. Yet another approach is to use 
iterative proportional fitting, described in Section 19.5.7. However, IPF requires identifying the 
cliques of the graph, which is NP-hard in general. 

Interestingly, one can show that the MLE must satisfy the following property: Sst = Sst if 
Gst = 1 or s = t, i.e., the covariance of a pair that are connected by an edge must match the 
empirical covariance. In addition, we have Qs = 0 if Gst = 0, by definition of a GGM, i.e., 
the precision of a pair that are not connected must be 0. We say that & is a positive definite 
matrix completion of S, since it retains as many of the entries in S as possible, corresponding 
to the edges in the graph, subject to the required sparsity pattern on X7 +, corresponding to the 
absent edges; the remaining entries in © are filled in so as to maximize the likelihood. 

Let us consider a worked example from (Hastie et al. 2009, p652). We will use the following 
adjacency matrix, representing the cyclic structure, X1 — Xə — X; — X4 — Xj, and the following 
empirical covariance matrix: 


0101 10 1 5 4 
1010 1 10 2 6 
C= le 1 01 55|5 2 10 3 (26.68) 
1010 4 6 3 10 
The MLE is given by 
10.00 1.00 1.31 4.00 0.12 -0.01 0 -0.05 
1.00 10.00 2.00 0.87 0.01 0.11 —0.02 0 
~=11431 200 1000 300|? 25| o -002 o11 0.03 | 266? 
4.00 0.87 3.00 10.00 0.05 0 0.03 0.13 


(See ggmFitDemo for the code to reproduce these numbers.) The constrained elements in Q, 
and the free elements in ©, both of which correspond to absent edges, have been highlighted. 


Graphical lasso 


We now discuss one way to learn a sparse GRF structure, which exploits the fact that there is a 
1:1 correspondence between zeros in the precision matrix and absent edges in the graph. This 
suggests that we can learn a sparse graph structure by using an objective that encourages zeros 
in the precision matrix. By analogy to lasso (see Section 13.3), one can define the following £1 
penalized NLL: 


J(Q) = — log det Q + tr(SM) + Al|QI|, (26.70) 
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lambda=36.00, nedges=8 lambda=27.00, nedges=11 
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Figure 26.18 Sparse GGMs learned using graphical lasso applied to the flow cytometry data. (a) A = 36. 
(b) A = 27. (c) A = 7. (d) A = 0. Figure generated by ggmLassoDemo. 


where ||&2||1 = 2; p |wjx| is the 1-norm of the matrix. This is called the graphical lasso or 
Glasso. 

Although the objective is convex, it is non-smooth (because of the non-differentiable ¢; 
penalty) and is constrained (because Q must be a positive definite matrix). Several algorithms 
have been proposed for optimizing this objective (Yuan and Lin 2007; Banerjee et al. 2008; Duchi 
et al. 2008), although arguably the simplest is the one in (Friedman et al. 2008), which uses a 
coordinate descent algorithm similar to the shooting algorithm for lasso. See ggmLassoHtf for 
an implementation. (See also (Mazumder and Hastie 2012) for a more recent version of this 
algorithm.) 

As an example, let us apply the method to the flow cytometry dataset from (Sachs et al. 2005). 
A discretized version of the data is shown in Figure 26.17(a). Here we use the original continuous 
data. However, we are ignoring the fact that the data was sampled under intervention. In 
Figure 26.18, we illustrate the graph structures that are learned as we sweep from 0 to a large 
value. These represent a range of plausible hypotheses about the connectivity of these proteins. 

It is worth comparing this with the DAG that was learned in Figure 26.17(b). The DAG has the 
advantage that it can easily model the interventional nature of the data, but the disadvantage 
that it cannot model the feedback loops that are known to exist in this biological pathway (see 
the discussion in (Schmidt and Murphy 2009)). Note that the fact that we show many UGMs and 
only one DAG is incidental: we could easily use BIC to pick the “best” UGM, and conversely, we 
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could easily display several DAG structures, sampled from the posterior. 


Bayesian inference for GGM structure * 


Although the graphical lasso is reasonably fast, it only gives a point estimate of the structure. 
Furthermore, it is not model-selection consistent (Meinshausen 2005), meaning it cannot recover 
the true graph even as N — oo. It would be preferable to integrate out the parameters, and 
perform posterior inference in the space of graphs, i.e., to compute p(G|D). We can then extract 
summaries of the posterior, such as posterior edge marginals, p(G;; = 1|D), just as we did for 
DAGs. In this section, we discuss how to do this. 

Note that the situation is analogous to Chapter 13, where we discussed variable selection. In 
Section 13.2, we discussed Bayesian variable selection, where we integrated out the regression 
weights and computed p(y|D) and the marginal inclusion probabilities p(y; = 1D). Then 
in Section 13.3, we discussed methods based on ¢, regularization. Here we have the same 
dichotomy, but we are presenting them in the opposite order. 

If the graph is decomposable, and if we use conjugate priors, we can compute the marginal 
likelihood in closed form (Dawid and Lauritzen 1993). Furthermore, we can efficiently identify 
the decomposable neighbors of a graph (Thomas and Green 2009), i.e., the set of legal edge 
additions and removals. This means that we can perform relatively efficient stochastic local 
search to approximate the posterior (see e.g. (Giudici and Green 1999; Armstrong et al. 2008; 
Scott and Carvalho 2008)). 

However, the restriction to decomposable graphs is rather limiting if one’s goal is knowledge 
discovery, since the number of decomposable graphs is much less than the number of general 
undirected graphs.° 

A few authors have looked at Bayesian inference for GGM structure in the non-decomposable 
case (e.g., (Dellaportas et al. 2003; Wong et al. 2003; Jones et al. 2005)), but such methods cannot 
scale to large models because they use an expensive Monte Carlo approximation to the marginal 
likelihood (Atay-Kayis and Massam 2005). (Lenkoski and Dobra 2008) suggested using a Laplace 
approxmation. This requires computing the MAP estimate of the parameters for Q under a G- 
Wishart prior (Roverato 2002). In (Lenkoski and Dobra 2008), they used the iterative proportional 
scaling algorithm (Speed and Kiiveri 1986; Hara and Takimura 2008) to find the mode. However, 
this is very slow, since it requires knowing the maximal cliques of the graph, which is NP-hard 
in general. 

In (Moghaddam et al. 2009), a much faster method is proposed. In particular, they modify 
the gradient-based methods from Section 26.7.1 to find the MAP estimate; these algorithms do 
not need to know the cliques of the graph. A further speedup is obtained by just using a 
diagonal Laplace approximation, which is more accurate than BIC, but has essentially the same 
cost. This, plus the lack of restriction to decomposable graphs, enables fairly fast stochastic 
search methods to be used to approximate p(G|D) and its mode. This approach significantly 
outperfomed graphical lasso, both in terms of predictive accuracy and structural recovery, for a 
comparable computational cost. 


5. The number of decomposable graphs on V nodes, for V = 2,..., 8, is as follows ((Armstrong 2005, p158)): 2; 8; 61; 
822; 18,154; 61,7675; 30,888,596. If we divide these numbers by the number of undirected graphs, which is 2V (V=1)/2, 
we find the ratios are: 1, 1, 0.95, 0.8, 0.55, 0.29, 0.12. So we see that decomposable graphs form a vanishing fraction of 
the total hypothesis space. 
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Handling non-Gaussian data using copulas * 


The graphical lasso and variants is inhertently limited to data that is jointly Gaussian, which is 
a rather severe restriction. Fortunately the method can be generalized to handle non-Gaussian, 
but still continuous, data in a fairly simple fashion. The basic idea is to estimate a set of D 
univariate monotonic transformations fj, one per variable j, such that the resulting transformed 
data is jointly Gaussian. If this is possible, we say the data belongs to the nonparametric 
Normal distribution, or nonparanormal distribution (Liu et al. 2009). This is equivalent to the 
family of Gaussian copulas (Klaassen and Wellner 1997). Details on how to estimate the fj 
transformations from the empirical cdf’s of each variable can be found in (Liu et al. 2009). After 
transforming the data, we can compute the correlation matrix and then apply glasso in the usual 
way. One can show, under various assumptions, that this is a consistent estimator of the graph 
structure, representing the CI assumptions of the original distribution(Liu et al. 2009). 


Learning undirected discrete graphical models 


The problem of learning the structure for UGMs with discrete variables is harder than the 
Gaussian case, because computing the partition function 7(0), which is needed for parameter 
estimation, has complexity comparable to computing the permanent of a matrix, which in 
general is intractable Jerrum et al. 2004). By contrast, in the Gaussian case, computing Z only 
requires computing a matrix determinant, which is at most O(V*). 

Since stochastic local search is not tractable for general discrete UGMs, below we mention 
some possible alternative approaches that have been tried. 


Graphical lasso for MRFs/CRFs 


It is possible to extend the graphical lasso idea to the discrete MRF and CRF case. However, now 
there is a set of parameters associated with each edge in the graph, so we have to use the graph 
analog of group lasso (see Section 13.5.1). For example, consider a pairwise CRF with ternary 
nodes, and node and edge potentials given by 


T T T T 
vis Wi walit Wait 
Pily x) = Vigx , Wst(Yss Yt: X) = | Wer X W 122% W 5t23% (26.71) 
Vł3X Wst31X Wst32X Wat33X 


where we assume x begins with a constant 1 term, to account for the offset. (If x only contains 
1, the CRF reduces to an MRF.) Note that we may choose to set some of the viz and Wstjk 
weights to 0, to ensure identifiability, although this can also be taken care of by the prior, as 
shown in Exercise 8.5. 

To learn sparse structure, we can minimize the following objective: 


N v v 
J = -> X log y (yit, xi ve) + D> 5 log Yst (Yis, Yit: Xi, Wst) 
iilt 


s=1t=s4+1 


Vv V V 
+A) SS Iwel +A2 >> livell? (26.72) 
t=1 


s=1 t=s+1 
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Figure 26.19 An MRF estimated from the 20-newsgroup data using group 41 regularization with \ = 256. 
Isolated nodes are not plotted. From Figure 5.9 of (Schmidt 2010). Used with kind permission of Mark 
Schmidt. 


where ||wst||p is the p-norm; common choices are p = 2 or p = œ, as explained in Sec- 
tion 13.5.1. This method of CRF structure learning was first suggested in (Schmidt et al. 2008). 
(The use of ¢; regularization for learning the structure of binary MRFs was proposed in (Lee 
et al. 2006).) 

Although this objective is convex, it can be costly to evaluate, since we need to perform 
inference to compute its gradient, as explained in Section 19.6.3 (this is true also for MRFs). We 
should therefore use an optimizer that does not make too many calls to the objective function 
or its gradient, such as the projected quasi-Newton method in (Schmidt et al. 2009). In addition, 
we can use approximate inference, such as convex belief propagation (Section 22.4.2), to compute 
an approximate objective and gradient more quickly. Another approach is to apply the group 
lasso penalty to the pseudo-likelihood discussed in Section 19.5.4. This is much faster, since 
inference is no longer required (Hoefling and Tibshirani 2009). Figure 26.19 shows the result of 
applying this procedure to the 20-newsgroup data, where y;; indicates the presence of word t 
in document 2, and x; = 1 (so the model is an MRF). 
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P (C=F) P(C=T) 
0.5 0.5 
C | P(S=F) | P(S=T) C | P(R=F) | P(R=T) 
0.5 0.5 (Gorner) C Rain) 0.8 0.2 

TI 09 0.1 T| 02 0.8 
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FT 0.1 0.9 

TT 0.01 0.99 


Figure 26.20 Water sprinkler DGM with corresponding binary CPTs. T and F stand for true and false. 


Thin junction trees 


So far, we have been concerned with learning “sparse” graphs, but these do not necessarily have 
low treewidth. For example, a D x D grid is sparse, but has treewidth O(D). This means that 
the models we learn may be intractable to use for inference purposes, which defeats one of the 
two main reasons to learn graph structure in the first place (the other reason being “knowledge 
discovery”). There have been various attempts to learn graphical models with bounded treewidth 
(e.g., (Bach and Jordan 2001; Srebro 2001; Elidan and Gould 2008; Shahaf et al. 2009)), also known 
as thin junction trees, but the exact problem in general is hard. 

An alternative approach is to learn a model with low circuit complexity (Gogate et al. 
2010; Poon and Domingos 2011). Such models may have high treewidth, but they exploit context- 
specific independence and determinism to enable fast exact inference (see e.g., (Darwiche 2009). 


Exercises 


Exercise 26.1 Causal reasoning in the sprinkler network 


Consider the causal network in Figure 26.20. Let T represent true and F represent false. 


a. Suppose I perform a perfect intervention and make the grass wet. What is the probability the sprinkler 
is on, p(S = T|do(W = T))? 


b. Suppose I perform a perfect intervention and make the grass dry. What is the probability the sprinkler 
is on, p(S = T|do(W = F))? 


c. Suppose I perform a perfect intervention and make the clouds “turn on” (e.g., by seeding them). What 
is the probability the sprinkler is on, p(S = T|do(C = T))? 
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Latent variable models for discrete data 


Introduction 


In this chapter, we are concerned with latent variable models for discrete data, such as bit vectors, 
sequences of categorical variables, count vectors, graph structures, relational data, etc. These 
models can be used to analyze voting records, text and document collections, low-intensity 
images, movie ratings, etc. However, we will mostly focus on text analysis, and this will be 
reflected in our terminology. 

Since we will be dealing with so many different kinds of data, we need some precise notation 
to keep things clear. When modeling variable-length sequences of categorical variables (i.e., 
symbols or tokens), such as words in a document, we will let y4 € {1,...,V} represent 
the identity of the l'th word in document i, where V is the number of possible words in 
the vocabulary. We assume l = 1 : Li, where L; is the (known) length of document i, and 
i = 1: N, where N is the number of documents. 

We will often ignore the word order, resulting in a bag of words. This can be reduced to 
a fixed length vector of counts (a histogram). We will use niy € {0,1,..., Li} to denote the 
number of times word v occurs in document i, for v = 1: V. Note that the N x V count 
matrix N is often large but sparse, since we typically have many documents, but most words 
do not occur in any given document. 

In some cases, we might have multiple different bags of words, e.g., bags of text words and 
bags of visual words. These correspond to different “channels” or types of features. We will 
denote these by yir for r = 1: R (the number of responses) and | = 1: Lip. If Lir = 1, it 
means we have a single token (a bag of length 1); in this case, we just write yir € {1,...,V,} 
for brevity. If every channel is just a single token, we write the fixed-size response vector as 
Yi.1:R; in this case, the N x R design matrix Y will not be sparse. For example, in social 
science surveys, Yir could be the response of person 7 to the r’th multi-choice question. 

Out goal is to build joint probability models of p(y;) or p(n;) using latent variables to capture 
the correlations. We will then try to interpret the latent variables, which provide a compressed 
representation of the data. We provide an overview of some approaches in Section 27.2, before 
going into more detail in later sections. 

Towards the end of the chapter, we will consider modeling graphs and relations, which can 
also be represented as sparse discrete matrices. For example, we might want to model the graph 
of which papers mycite which other papers. We will denote these relations by R, reserving the 
symbol Y for any categorical data (e.g., text) associated with the nodes. 
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In this section, we summarize a variety of possible approaches for constructing models of the 
form p(yi,1:1,), for bags of tokens; p(y1:r), for vectors of tokens; and p(n;), for vectors of 
integer counts. 


Mixture models 


The simplest approach is to use a finite mixture model (Chapter 11). This associates a single 
discrete latent variable, q; € {1,..., K}, with every document, where K is the number of 
clusters. We will use a discrete prior, g; ~ Cat(7). For variable length documents, we can 
define p(yulqi = k) = bey, where by, is the probability that cluster k generates word v. The 
value of q; is known as a topic, and the vector bx is the k’th topic’s word distribution. That is, 


the likelihood has the form 
Ly 


qi = k) = | | Cat(yulbx) (27.1) 
l=1 


P(Via:Li 


The induced distribution on the visible data is given by 


Li 
P(Via:t:) =X Tk iW Cat(yi Po) (27.2) 
k i=1 


The “generative story” which this encodes is as follows: for document i, pick a topic q; from 
m, call it k, and then for each word l = 1 : L;, pick a word from bx. We will consider more 
sophisticated generative models later in this chapter. 

If we have a fixed set of categorical observations, we can use a different topic matrix for each 
output variable: 


R 
P(¥iair|gi = k) = [| Cat(yilby”) (27.3) 


r=1 


This is an unsupervised analog of naive Bayes classification. 


We can also model count vectors. If the sum L; = „Niv is known, we can use a 
multinomial: 
p(nj|Li, qi = k) = Mu(n;|L;, bx) (27.4) 


If the sum is unknown, we can use a Poisson class-conditional density to give 


V 
p(nilg: = k) = | | Poi(nin|Avk) (27.5) 
v=1 


In this case, Lj|q; = k ~ Poi(>,, Ave). 


27.2.2 


27.2. Distributed state LVMs for discrete data 947 


Exponential family PCA 


Unfortunately, finite mixture models are very limited in their expressive power. A more flexible 
model is to use a vector of real-valued continuous latent variables, similar to the factor analysis 
(FA) and PCA models in Chapter 12. In PCA, we use a Gaussian prior of the form p(z;) = 
N (wu, ©), where z; € R“, and a Gaussian likelihood of the form p(y;|z;) = M(W2z;,071). 
This method can certainly be applied to discrete or count data. Indeed, the method known 
as latent semantic analysis (LSA) or latent semantic indexing (LSI) (Deerwester et al. 1990; 
Dumais and Landauer 1997) is exactly equivalent to applying PCA to a term by document count 
matrix. 

A better method for modeling categorical data is to use a multinoulli or multinomial distribu- 
tion. We just have to change the likelihood to 


Li 
zi) = | | Cat(yi|S(W2:)) (27.6) 


l=1 


P(ViaLi 


where W € RY ** is a weight matrix and S is the softmax function. If we have a fixed number 
of categorical responses, we can use 
R 
P(¥1:R|Z:) = II Cat (yir|S(W,z:)) (27.7) 
Pæt 
where W, € RY** is the weight matrix for the r’th response variable. This model is called 
categorical PCA, and is illustrated in Figure 27.1(a); see Section 12.4 for further discussion. If we 
have counts, we can use a multinomial model 


p(n; |Le, Zi) = Mu(n;|Li, S(Wz;)) (27.8) 
or a Poisson model 
V 
p(ni|zi) = | | Poi(nie| exp(w3:z:)) (27.9) 
u=i1 


All of these models are examples of exponential family PCA or ePCA (Collins et al. 2002; 
Mohamed et al. 2008), which is an unsupervised analog of GLMs. The corresponding induced 
distribution on the visible variables has the form 


L; 
P(Yi,1:L;) = Mirow) N (zilu, &)dzi (27.10) 
i=1 


Fitting this model is tricky, due to the lack of conjugacy. (Collins et al. 2002) proposed a 
coordinate ascent method that alternates between estimating the z; and W. This can be 
regarded as a degenerate version of EM, that computes a point estimate of z; in the E step. The 
problem with the degenerate approach is that it is very prone to overfitting, since the number 
of latent variables is proportional to the number of datacases (Welling et al. 2008). A true EM 
algorithm would marginalize out the latent variables z;. A way to do this for categorical PCA, 
using variational EM, is discussed in Section 12.4. For more general models, one can use MCMC 
(Mohamed et al. 2008). 
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(a) (b) 


Figure 27.1 Two LVMs for discrete data. Circles are scalar nodes, ellipses are vector nodes, squares are 
matrix nodes. (a) Categorical PCA. (b) Multinomial PCA. 


LDA and mPCA 


In ePCA, the quantity Wz; represents the natural parameters of the exponential family. Some- 
times it is more convenient to use the dual parameters. For example, for the multinomial, the 
dual parameter is the probability vector, whereas the natural parameter is the vector of log odds. 

If we want to use the dual parameters, we need to constrain the latent variables so they live 
in the appropriate parameter space. In the case of categorical data, we will need to ensure the 
latent vector lives in Sg, the k-dimensional probability simplex. To avoid confusion with ePCA, 
we will denote such a latent vector by z;. In this case, the natural prior for the latent variables 
is the Dirichlet, m; ~ Dir(@). Typically we set a = alx. If we set a < 1, we encourage Ti 
to be sparse, as shown in Figure 2.14. 

When we have a count vector whose total sum is known, the likelihood is given by 


p(ni|Li, mi) = Mu(n,|L;, Br; ) (27.11) 


This model is called multinomial PCA or mPCA (Buntine 2002; Buntine and Jakulin 2004, 
2006). See Figure 27.1(b). Since we are assuming ni, = >>; bukTiv, this can be seen as a form 
of matrix factorization for the count matrix. Note that we use b, œ to denote the parameter 
vector, rather than w,,,, since we impose the constraints that 0 < by, < 1 and >>, bvk = 1. 
The corresponding marginal distribution has the form 


Unfortunately, this integral cannot be computed analytically. 


If we have a variable length sequence (of known length), we can use 


Li 
Ti) = | | Cat(yulBr:) (27.13) 
l=1 


P(ViasLi 
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This is called latent Dirichlet allocation or LDA (Blei et al. 2003), and will be described in 
much greater detail below. LDA can be thought of as a probabilistic extension of LSA, where the 
latent quantities 7; are non-negative and sum to one. By contrast, in LSA, z;, can be negative 
which makes interpetation difficult. 

A predecessor to LDA, known as probabilistic latent semantic indexing or PLSI (Hofmann 
1999), uses the same model but computes a point estimate of 7; for each document (similar to 
ePCA), rather than integrating it out. Thus in PLSI, there is no prior for 77;. 

We can modify LDA to handle a fixed number of different categorical responses as follows: 


R 
P(yia:rli) = | | Cat(ya|BO a) (27.14) 


r=1 


This has been called the user rating profile (URP) model (Marlin 2003), and the simplex factor 
model (Bhattacharya and Dunson 2011). 


GaP model and non-negative matrix factorization 


Now consider modeling count vectors where we do not constrain the sum to be observed. In 
this case, the latent variables just need to be non-negative, so we will denote them by z7. This 
can be ensured by using a prior of the form 


K 
plz?) = | | Galeilar, r) (27.15) 
k=1 
The likelihood is given by 
V 
plnila?) = [| Poi(nin|by z7) (27.16) 
v=1 


This is called the GaP (Gamma-Poisson) model (Canny 2004). See Figure 27.2(a). 

In (Buntine and Jakulin 2006), it is shown that the GaP model, when conditioned on a fixed 
L;, reduces to the mPCA model. This follows since a set of Poisson random variables, when 
conditioned on their sum, becomes a multinomial distribution (see e.g., (Ross 1989)). 

If we set a, = Bk = 0 in the GaP model, we recover a method known as non-negative 
matrix factorization or NMF (Lee and Seung 2001), as shown in (Buntine and Jakulin 2006). 
NMF is not a probabilistic generative model, since it does not specify a proper prior for zf. 
Furthermore, the algorithm proposed in (Lee and Seung 2001) is another degenerate EM algo- 
rithm, so suffers from overfitting. Some procedures to fit the GaP model, which overcome these 
problems, are given in (Buntine and Jakulin 2006). 

To encourage z; to be sparse, we can modify the prior to be a spike-and-Gamma type prior 
as follows: 


plz) = pllj = 0) + (1 — pr)Galzi on, Bx) (27.17) 


where pw is the probability of the spike at 0. This is called the conditional Gamma Poisson 
model (Buntine and Jakulin 2006). It is simple to modify Gibbs sampling to handle this kind of 
prior, although we will not go into detail here. 
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Figure 27.2 (a) Gaussian-Poisson (GAP) model. (b) Latent Dirichlet allocation (LDA) model. 


Latent Dirichlet allocation (LDA) 


In this section, we explain the latent Dirichlet allocation or LDA (Blei et al. 2003) model in 
detail. 


Basics 


In a mixture of multinoullis, every document is assigned to a single topic, q; € {1,..., K}, 
drawn from a global distribution 7. In LDA, every word is assigned to its own topic, qi € 
{1,..., K}, drawn from a document-specific distribution 7;. Since a document belongs to a 
distribution over topics, rather than a single topic, the model is called an admixture mixture 
or mixed membership model (Erosheva et al. 2004). This model has many other applications 
beyond text analysis, e.g., genetics (Pritchard et al. 2000), health science (Erosheva et al. 2007), 
social network analysis (Airoldi et al. 2008), etc. 
Adding conjugate priors to the parameters, the full model is as follows:! 


mila ~ Dir(alx) (27.18) 
quļmi ~ Cat(m;) (27.19) 
byly ~ Dir(y1v) (27.20) 
yalqu =k,B ~ Cat(bx) (27.21) 


This is illustrated in Figure 27.2(b). We can marginalize out the q; variables, thereby creating a 


1. Our notation is similar to the one we use elsewhere in this book, but is different from that used by most LDA papers. 
They typically use Wna for the identity of word n in document d, z,,q to represent the discrete indicator, Oq as the 
continuous latent vector for document d, and Bẹ as the k’th topic vector. 
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P(word1) 


@= topic 
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@= generated 
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1 P(word2) 


Figure 27.3 Geometric interpretation of LDA. We have K = 2 topics and V = 3 words. Each document 
(white dots), and each topic (black dots), is a point in the 3d simplex. Source: Figure 5 of (Steyvers and 
Griffiths 2007). Used with kind permission of Tom Griffiths. 


direct arc from 7; to yi, with the following CPD: 


plya = vii) = Ll yi = vida = k)p(qu = k) = Tikbku (27.22) 


As we mentioned in the introduction, this is very similar to the multinomial PCA model proposed 
in (Buntine 2002), which in turn is closely related to categorical PCA, GaP, NMF, etc. 

LDA has an interesting geometric interpretation. Each vector bẹ defines a distribution over 
V words; each k is known as a topic. Each document vector m; defines a distribution over K 
topics. So we model each document as an admixture over topics. Equivalently, we can think 
of LDA as a form of dimensionality reduction (assuming K < V, as is usually the case), where 
we project a point in the V-dimensional simplex (a normalized document count vector x;) onto 
the K-dimensional simplex. This is illustrated in Figure 27.3, where we have V = 3 words and 
K = 2 topics. The observed documents (which live in the 3d simplex) are approximated as 
living on a 2d simplex spanned by the 2 topic vectors, each of which lives in the 3d simplex. 

One advantage of using the simplex as our latent space rather than Euclidean space is that 
the simplex can handle ambiguity. This is importance since in natural language, words can often 
have multiple meanings, a phenomomen known as polysemy. For example, “play” might refer 
to a verb (e.g., “to play ball” or “to play the coronet”), or to a noun (e.g., “Shakespeare's play”). 
In LDA, we can have multiple topics, each of which can generate the word “play”, as shown in 
Figure 27.4, reflecting this ambiguity. 

Given word l in document i, we can compute p(qi = k|y;,@), and thus infer its most likely 
topic. By looking at the word in isolation, it might be hard to know what sense of the word is 
meant, but we can disambiguate this by looking at other words in the document. In particular, 
given x;, we can infer the topic distribution m; for the document; this acts as a prior for 
disambiguating qi. This is illustrated in Figure 27.5, where we show three documents from the 
TASA corpus.” In the first document, there are a variety of music related words, which suggest 


2. The TASA corpus is a collection of 37,000 high-school level English documents, comprising over 10 million words, 
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Topic 77 Topic 82 Topic 166 
word prob. word prob. word prob. 
MUSIC .090 LITERATURE .031 PLAY .136 
DANCE _ .034 POEM .028 BALL .129 
SONG .033 POETRY  .027 GAME _ .065 
PLAY .030 POET .020 PLAYING  .042 
SING .026 PLAYS .019 HIT .032 
SINGING — .026 POEMS .019 PLAYED  .031 
BAND _ .026 PLAY .015 BASEBALL .027 
PLAYED .023 LITERARY _ .013 GAMES _ .025 
SANG .022 WRITERS — .013 BAT .019 
SONGS .021 DRAMA _ .012 RUN .019 
DANCING _ .020 WROTE 012 THROW 016 
PIANO .017 POETS .011 BALLS .015 
PLAYING .016 WRITER .011 TENNIS .011 
RHYTHM .015 SHAKESPEARE — .010 HOME .010 
ALBERT _ .013 WRITTEN 009 CATCH .010 
MUSICAL __.013 STAGE _.009 FIELD _.010 


Figure 27.4 Three topics related to the word play. 
Used with kind permission of Tom Griffiths. 


Source: Figure 9 of (Steyvers and Griffiths 2007). 


Document #29795 
age” fifteen”, sat!” slope’” bluff’? overlooking?” mississippi'*’ river'*” 
listening’” to music” coming” passing”? music captured™ his heart!” 
ear!’ jazz?” music” lessons” showed” promise'™ 
piano’” parents” hoped” consider''® concert?” pianist?” 
interested” kind?” of music?” wanted" lay” wanted? lay”) jazz"... 
Document #1883 
simple” reason periods’ theater”? western’ 
things™™” actors’? 
actors”? audiences” remember*** 
plays” exist’? performed?” merely” read’** read*™ a [play] try 
perform"” put!” stage”’* soon”* lay} performed’ 
kind!” of theatrical”... 
Document #21359 
Jim?” game!” book” Jim?” reads” ‘he book? Jim?”* sees**! » game!” Jim’? plays'® the game! 
Jim” likes”™ game’ game’ book’™ helps’ jim?” Don'*? comes” house”! Don'*? 
jim?™ read?** game! book? boys?” game! boys” [play'™ game!®° 
boys”? [play'™4 game!” boys”? game™“® Meg”? comes™” house??? Meg” 
don'*° jim?” read?’™ ihe book? game'® Meg”? don'*° jim?” [play'™) game! 


166] 


lay 


Figure 27.5 Three documents from the TASA corpus containing different senses of the word play. Grayed 
out words were ignored by the model, because they correspond to uninteresting stop words (such as “and”, 


“the”, etc.) or very low frequency words. 
kind permission of Tom Griffiths. 


Source: Figure 10 of (Steyvers and Griffiths 2007). Used with 
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mi will put most of its mass on the music topic (number 77); this in turn makes the music 
interpretation of “play” the most likely, as shown by the superscript. The second document 
interprets play in the theatrical sense, and the third in the sports sense. Note that is crucial 
that m; be a latent variable, so information can flow between the qi’s, thus enabling local 
disambiguation to use the full set of words. 


Unsupervised discovery of topics 


One of the main purposes of LDA is discover topics in a large collection or corpus of docu- 
ments (see Figure 27.12 for an example). Unfortunately, since the model is unidentifiable, the 
interpertation of the topics can be difficult (Chang et al. 2009).. One approach, known as la- 
beled LDA (Ramage et al. 2009), exploits the existence of tags on documents as a way to ensure 
identifiability. In particular, it forces the topics to correspond to the tags, and then it learns a 
distribution over words for each tag. This can make the results easier to interpret. 


Quantitatively evaluating LDA as a language model 


In order to evaluate LDA quantitatively, we can treat it as a language model, i.e., a probability 
distribution over sequences of words. Of course, it is not a very good language model, since it 
ignores word order and just looks at single words (unigrams), but it is interesting to compare 
LDA to other unigram-based models, such as mixtures of multinoullis, and pLSI. Such simple 
language models are sometimes useful for information retrieval purposes. The standard way to 
measure the quality of a language model is to use perplexity, which we now define below. 


Perplexity 
The perplexity of language model q given a stochastic process? p is defined as 

perplexity(p, q) & 24.) (27.23) 
where H (p,q) is the cross-entropy of the two stochastic processes, defined as 

1 
H £ lim =-= N)l ! 27.24 
(p,q) = lim N 2o Pen) ogq(yı:n) (27.24) 
1:N 


The cross entropy (and hence perplexity) is minimized if q = p; in this case, the model can 
predict as well as the “true” distribution. 

We can approximate the stochastic process by using a single long test sequence (composed 
of multiple documents and multiple sentences, complete with end-of-sentence markers), call 
it y}.,y- (This approximation becomes more and more accurate as the sequence gets longer, 
provided the process is stationary and ergodic (Cover and Thomas 2006).) Define the empirical 
distribution (an approximation to the stochastic process) as 


Pemp(Y1:w) = fyz y (Yin) (27.25) 


collated by a company formerly known as Touchstone Applied Science Associates, but now known as Questar Assessment 
Inc www.questarai.com. 

3. A stochastic process is one which can define a joint distribution over an arbitrary number of random variables. We 
can think of natural language as a stochastic process, since it can generate an infinite stream of words. 
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In this case, the cross-entropy becomes 


1 * 
H (Pemp, q) = ON log a(Yi:n) (27.26) 


and the perplexity becomes 


N 


perplexity(Pemp,q) = 2% C9) = q(yi.y) VN = 4 (27.27) 
j=l q (y aa H= 1) 


We see that this is the geometric mean of the inverse predictive probabilities, which is the usual 
definition of perplexity Jurafsky and Martin 2008, p96). 
In the case of unigram models, the cross entropy term is given by 


1 i 
a, 2 L S © log a(yit) (27.28) 


where N is the number of documents and L; is the number of words in document i. Hence 
the perplexity of model q is given by 


N Li 
l 1 eae 
perplexity(pemp,P) = exp (- NWT, 2 log aw) (27.29) 


Intuitively, perplexity mesures the weighted average branching factor of the model’s predic- 
tive distribution. Suppose the model predicts that each symbol (letter, word, whatever) is equally 
likely, so p(y;|y1:-1) = 1/K. Then the perplexity is ((1/K)“)—'/N = K. If some symbols 
are more likely than others, and the model correctly reflects this, its perplexity will be lower 
than K. Of course, H(p,p) = H(p) < H(p,q), so we can never reduce the perplexity below 
the entropy of the underlying stochastic process. 


Perplexity of LDA 


The key quantity is p(v), the predictive distribution of the model over possible words. (It is 
implicitly conditioned on the training set.) For LDA, this can be approximated by plugging in 
B (eg., the posterior mean estimate) and approximately integrating out q using mean field 
inference (see (Wallach et al. 2009) for a more accurate way to approximate the predictive 
likelihood). 

In Figure 27.6, we compare LDA to several other simple unigram models, namely MAP estima- 
tion of a multinoulli, MAP estimation of a mixture of multinoullis, and pLSI. (When performing 
MAP estimation, the same Dirichlet prior on B was used as in the LDA model.) The metric 
is perplexity, as in Equation 27.29, and the data is a subset of the TREC AP corpus containing 
16,333 newswire articles with 23,075 unique terms. We see that LDA significantly outperforms 
these other methods. 
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Figure 27.6 Perplexity vs number of topics on the TREC AP corpus for various language models. Based 
on Figure 9 of (Blei et al. 2003). Figure generated by bleiLDAperplexityPlot. 


Figure 27.7 (a) LDA unrolled for N documents. (b) Collapsed LDA, where we integrate out the 7; and 
the Dp. 


27.3.4 Fitting using (collapsed) Gibbs sampling 
It is straightforward to derive a Gibbs sampling algorithm for LDA. The full conditionals are as 


follows: 
Pld =k|) x  expllog mix + log bkz] (27.30) 
p(mi|-) = Dir({ax +} I(zu = k)}) (27.31) 
l 
p(bs|:) = Dir({w +) > Iza = v, za = k)}) (27.32) 
i l 


However, one can get better performance by analytically integrating out the m;’s and the b,’s, 
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both of which have a Dirichlet distribution, and just sampling the discrete q;;’s. This approach 
was first suggested in (Griffiths and Steyvers 2004), and is an example of collapsed Gibbs 
sampling. Figure 27.7(b) shows that now all the q; variables are fully correlated. However, we 
can sample them one at a time, as we explain below. 

First, we need some notation. Let Civk = yee I(qu = k, yi = v) be the number of times 
word v is assigned to topic k in document i. Let cj; = >>,, Civk be the number of times any 
word from document 7 has been assigned to topic k. Let Cok = >>; Civg be the number of times 
word v has been assigned to topic k in any document. Let niy = `} Civk be the number of 
times word v occurs in document i; this is observed. Let ck = )>,, Cvk be the number of words 
assigned to topic k. Finally, let L; = 5°, Cik be the number of words in document i; this is 
observed. 

We can now derive the marginal prior. By applying Equation 5.24, one can show that 


Li 
II / iW Cat(qir|7;) 


- (Hs 


By similar reasoning, one can show 


Dir(ni|æalg)dri (27.33) 


p(ala) 


ag oe m Mea ta) iM (ck + a) (27.34) 


p(yla,y) = II / J| Cat(valbx) | Dir(byly1v)db; (27.35) 
ilıq; =k 
m“ T'(cor =- B) 
27.36 
Ga 2 Tate Tier + VB) oe 


From the above equations, and using the fact that T(x + 1)/T (x) = 2, we can derive the full 
conditional for p(qi:|q_i,.). Define c;,,,, to be the same as c;,,, except it is compute by summing 
over all locations in document į except for qu. Also, let yi = v. Then 

Cur tY Cipro 


i =k —il; y, Q, x 2 2 = (27.37) 
pda = k|d-i1 y, œ, 7) oi 


We see that a word in a document is assigned to a topic based both on how often that word is 
generated by the topic (first term), and also on how often that topic is used in that document 
(second term). 

Given Equation 27.37, we can implement the collapsed Gibbs sampler as follows. We randomly 
assign a topic to each word, qu € {1,..., K}. We can then sample a new topic as follows: for 
a given word in the corpus, decrement the relevant counts, based on the topic assigned to the 
current word; draw a new topic from Equation 27.37, update the count matrices; and repeat. 
This algorithm can be made efficient since the count matrices are very sparse. 


Example 


This process is illustrated in Figure 27.8 on a small example with two topics, and five words. 
The left part of the figure illustrates 16 documents that were sampled from the LDA model using 
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Figure 27.8 Illustration of (collapsed) Gibbs sampling applied to a small LDA example. There are N = 16 
documents, each containing a variable number of words drawn from a vocabulary of V = 5 words, There 
are two topics. A white dot means word the word is assigned to topic 1, a black dot means the word is 
assigned to topic 2. (a) The initial random assignment of states. (b) A sample from the posterior after 64 
steps of Gibbs sampling. Source: Figure 7 of (Steyvers and Griffiths 2007). Used with kind permission of 
Tom Griffiths. 


p(money|k = 1) = p(loan|k = 1) = p(bank|k = 1) = 1/3 and p(river|k = 2) = p(stream|k = 
2) = p(bank|k = 2) = 1/3. For example, we see that the first document contains the word 
“bank” 4 times (indicated by the four dots in row 1 of the “bank” column), as well as various 
other financial terms. The right part of the figure shows the state of the Gibbs sampler after 
64 iterations. The “correct” topic has been assigned to each token in most cases. For example, 
in document 1, we see that the word “bank” has been correctly assigned to the financial topic, 
based on the presence of the words “money” and “loan”. The posterior mean estimate of the 
parameters is given by p(money|k = 1) = 0.32, p(loan|k = 1) = 0.29, p(bank|k = 1) = 
0.39, p(river|k = 2) = 0.25, p(stream|k = 2) = 0.4, and p(bank|k = 2) = 0.35, which is 


impressively accurate, given that there are only 16 training examples. 


Fitting using batch variational inference 


A faster alternative to MCMC is to use variational EM. (We cannot use exact EM since exact 
inference of m; and q; is intractable.) We give the details below. 


Sequence version 


Following (Blei et al. 2003), we will use a fully factorized (mean field) approximation of the form 
q(Ti, qi) = Dir(m;|ã:) | | Cat (gira) (27.38) 
i 


We will follow the usual mean field recipe. For q(qi), we use Bayes’ rule, but where we need to 
take expectations over the prior: 


dilk x by; ak exp( D flog Tik]) (27.39) 
where 
3 [log mix] = Vr (Ti.) E (Tir) — VOD _ iw) (27.40) 


k! 
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where W is the digamma function. The update for q(m;) is obtained by adding up the expected 
counts: 


Tik = Qk+ > ilk (27.41) 
l 
The M step is obtained by adding up the expected counts and normalizing: 


N Li 
bok x w +Y YO Guel(yu = v) (27.42) 


i=1 l=1 


Count version 


Note that the E step takes O((X_; L;)V K) space to store the qj. It is much more space 
efficient to perform inference in the mPCA version of the model, which works with counts; these 
only take O(NV K) space, which is a big savings if documents are long. (By contrast, the 
collapsed Gibbs sampler must work explicitly with the qi; variables.) 

We will focus on approximating p(7;, c;|n;, Li), where we write c; as shorthand for c;,.. We 
will again use a fully factorized (mean field) approximation of the form 


q(t, ci) = Dir(r; Ti) JI Mu (Civ. \niv, Civ.) (27.43) 


Vv 


The new E step becomes 


fik = Ont)  nivtiv (27.44) 
Vv 


Civk X box exp(E [log mix]) (27.45) 


The new M step becomes 


bok Yt >) nition (27.46) 


a 


VB version 


We now modify the algorithm to use VB instead of EM, so that we infer the parameters as 
well as the latent variables. There are two advantages to this. First, by setting y < 1, VB will 
encourage B to be sparse (as in Section 21.6.1.6). Second, we will be able to generalize this to 
the online learning setting, as we discuss below. 

Our new posterior approximation becomes 


q(mi, ci, B) = Dir(m;| 774) | | Mu(cin. |niv, Civ.) | | Dir(b.x/b.x) (27.47) 
v k 


The update for Čivęķ changes, to the following: 


Čivk x exp (E [log byg] + E [log 7:x]) (27.48) 
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Algorithm 27.1: Batch VB for LDA 


1 Input: ni», K, ag, Yos 


2 Estimate b,,, using EM for multinomial mixtures; 


3 Initialize counts niy; 


4 


while not converged do 
II E step ; 


Suk = 0 // expected sufficient statistics; 
for each document i = 1 : N do 
(Fi Či) = Estep(n;, B, a); 
| Svkt = NivČivk 
II M step ; 
for each topic k = 1 : K do 
lE bok = Yu + Suki 


function (7;,€;) = Estep(n;, B, a); 
Initialize Tik = Qk; 
repeat 
Te = iy Wie = OG 
for each word v = 1 : V do 

for each topic k = 1 : K do 


| Give = exp (Ye (Bs.) + v(a? ); 


Čiv. = normalize (č;v. ); 
Tikt = NivCivk 


until + >; [Tiz — T] < thresh; 


Also, the M step becomes 


bok = Yu + DD Čivk 
i 


959 


(27.49) 


No normalization is required, since we are just updating the pseudcounts. The overall algorithm 
is summarized in Algorithm 22. 


Fitting using online variational inference 


In the bathe version, the E step clearly takes O(NKVT) time, where T is the number of 
mean field updates (typically T ~ 5). This can be slow if we have many documents. This can 
be reduced by using stochastic gradient descent (Section 8.5.2) to perform online variational 
inference, as we now explain. 
We can derive an online version, following (Hoffman et al. 2010). We perform an E step in the 
usual way. We then compute the variational parameters for B treating the expected sufficient 
statistics from the single data case as if the whole data set had those statistics. Finally, we make 
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Algorithm 27.2: Online variational Bayes for LDA 
1 Input: niv, K, Qk, Ww, To, KI 

2 Initialize b,,, randomly; 

3 for t = 1 : œ do 

4 Set step size pp = (To +t)"; 

5 Pick document i = i(t); ; 

6 (ñi, či) = Estep(n;, B, a); 

7 hea = Ww + NnivCivks 


8 bak = (1 = Pt) bor + pane”: 
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Figure 27.9 Test perplexity vs number of training documents for batch and online VB-LDA. From Figure 
1 of (Hoffman et al. 2010). Used with kind permission of David Blei. 


a partial update for the variational parameters for B, putting weight p, on the new estimate 
and weight 1 — p+ on the old estimate. The step size p; decays over time, as in Equation 8.83. 
The overall algorithm is summarized in Algorithm 3. In practice, we should use mini-batches, 
as explained in Section 8.5.2.3. In (Hoffman et al. 2010), they used a batch of size 256-4096. 

Figure 27.9 plots the perplexity on a test set of size 1000 vs number of analyzed documents (E 
steps), where the data is drawn from (English) Wikipedia. The figure shows that online variational 
inference is much faster than offline inference, yet produces similar results. 


Determining the number of topics 


Choosing K, the number of topics, is a standard model selection problem. Here are some 
approaches that have been taken: 


e Use annealed importance sampling (Section 24.6.2) to approximate the evidence (Wallach 
et al. 2009). 
e Cross validation, using the log likelihood on a test set. 
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e Use the variational lower bound as a proxy for log p(D|K). 
e Use non-parametric Bayesian methods (Teh et al. 2006). 


Extensions of LDA 


Many extensions of LDA have been proposed since the first paper came out in 2003. We briefly 
discuss a few of these below. 


Correlated topic model 


One weakness of LDA is that it cannot capture correlation between topics. For example, if a 
document has the “business” topic, it is reasonable to expect the “finance” topic to co-occcur. 
The source of the problem is the use of a Dirichlet prior for m;. The problem with the Dirichelt 
it that it is characterized by just a mean vector and a strength parameter, but its covariance is 
fixed (Xj; = —a;a,), rather than being a free parameter. 

One way around this is to replace the Dirichlet prior with the logistic normal distribution, as 
in categorical PCA (Section 27.2.2). The model becomes 


bly ~ Dir(y1v) (27.50) 

zi ~ N(p,d) (27.51) 

Tiz = S(z;) (27.52) 

qulni ~ Cat(7;) (27.53) 

yYyulqu =k,B ~ Cat(bx) (27.54) 


This is known as the correlated topic model (Blei and Lafferty 2007). This is very similar to 
categorical PCA, but slightly different. To see the difference, let us marginalize out the q; and 
mi. Then in the CTM we have 


yi ~ Cat(BS(z;)) (27.55) 
where B is a stochastic matrix. By contrast, in catPCA we have 


where W is an unconstrained matrix. 

Fitting this model is tricky, since the prior for m; is no longer conjugate to the multinomial 
likelihood for qi. However, we can use any of the variational methods in Section 21.8.1.1, where 
we discussed Bayesian multiclass logistic regression. In the CTM case, things are even harder 
since the categorical response variables q; are hidden, but we can handle this by using an 
additional mean field approximation. See (Blei and Lafferty 2007) for details. 


Having fit the model, one can then convert È to a sparse precision matrix S` by pruning 
low-strength edges, to get a sparse Gaussian graphical model. This allows you to visualize the 
correlation between topics. Figure 27.10 shows the result of applying this procedure to articles 
from Science magazine, from 1990-1999. (This corpus contains 16,351 documents, and 5.7M words 
(19,088 of them unique), after stop-word and low-frequency removal.) Nodes represent topics, 
with the top 5 words per topic listed inside. The font size reflects the overall prevalence of the 
topic in the corpus. Edges represent significant elements of the precision matrix. 
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Figure 27.10 Output of the correlated topic model (with K = 50 topics) when applied to articles from 
Science. Nodes represent topics, with the 5 most probable phrases from each topic shown inside. Font 
size reflects overall prevalence of the topic. See http://www.cs.cmu.edu/~lemur/science/ for an 
interactive version of this model with 100 topics. Source: Figure 2 of (Blei and Lafferty 2007). Used with 
kind permission of David Blei. 
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27.4.2 Dynamic topic model 


In LDA, the topics (distributions over words) are assumed to be static. In some cases, it makes 
sense to allow these distributions to evolve smoothly over time. For example, an article might 
use the topic “neuroscience”, but if it was written in the 1900s, it is more likely to use words 
like “nerve”, whereas if it was written in the 2000s, it is more likely to use words like “calcium 
receptor” (this reflects the general trend of neuroscience towards molecular biology). 

One way to model this is use a dynamic logistic normal model, as illustrated in Figure 27.11. 
In particular, we assume the topic distributions evolve according to a Gaussian random walk, 
and then we map these Gaussian vectors to probabilities via the softmax function: 


bi cl|bi-146 ~ N(bi-1,4,071v) (27.57) 
nm, ~ Dir(alx) (27.58) 

dil™; Cat(7;) (27.59) 

yildi = k, BY Cat(S(b})) (27.60) 


This is known as a dynamic topic model (Blei and Lafferty 2006b). 
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Figure 27.11 The dynamic topic model. 


One can perform approximate infernece in this model using a structured mean field method 
(Section 21.4), that exploits the Kalman smoothing algorithm (Section 18.3.1) to perform exact 
inference on the linear-Gaussian chain between the b; ;, nodes (see (Blei and Lafferty 2006b) for 
details). 

Figure 27.12 illustrates a typical output of the system when applied to 100 years of articles 
from Science. On the top, we visualize the top 10 words from a specific topic (which seems to 
be related to neuroscience) after 10 year intervals. On the bottom left, we plot the probability 
of some specific words belonging to this topic. On the bottom right, we list the titles of some 
articles that contained this topic. 

One interesting application of this model is to perform temporally-corrected document re- 
trieval. That is, suppose we look for documents about the inheritance of disease. Modern 
articles will use words like “DNA”, but older articles (before the discovery of DNA) may use other 
terms such as “heritable unit”. But both articles are likely to use the same topics. Similar ideas 
can be used to perform cross-language information retrieval, see e.g., (Cimiano et al. 2009). 


LDA-HMM 


The LDA model assumes words are exchangeable, which is clearly not true. A simple way 
to model sequential dependence between words is to use a hidden Markov model or HMM. 
The trouble with HMMs is that they can only model short-range dependencies, so they cannot 
capture the overall gist of a document. Hence they can generate syntactically correct sentences 
(see e.g., Table 17.1). but not semantically plausible ones. 

It is possible to combine LDA with HMM to create a model called LDA-HMM (Griffiths et al. 
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Figure 27.12 Part of the output of the dynamic topic model when applied to articles from Science. We 
show the top 10 words for the neuroscience topic over time. We also show the probability of three words 
within this topic over time, and some articles that contained this topic. Source: Figure 4 of (Blei and 
Lafferty 2006b). Used with kind permission of David Blei. 


2004). This model uses the HMM states to model function or syntactic words, such as “and” or 
“however”, and uses the LDA to model content or semantic words, which are harder to predict. 
There is a distinguished HMM state which specifies when the LDA model should be used to 
generate the word; the rest of the time, the HMM generates the word. 

More formally, for each document i, the model defines an HMM with states z; € {0,...,C}. 
In addition, each document has an LDA model associated with it. If z; = 0, we generate word 
yi from the semantic LDA model, with topic specified by qi; otherwise we generate word y; 
from the syntactic HMM model. The DGM is shown in Figure 27.13. The CPDs are as follows: 


p(iwi) = Dir(a,|a1x) (27.61) 

pda = k|mi) = Tik (27.62) 

plzu = dza =9) = APY" (ec) (27.63) 
LDA : = 

p(ya = lau = k, za = £) ee neen (27.64) 


BEMM (e.0) ifc>0 


where BLDA BEMM 


AH MM 


is the usual topic-word matrix, is the state-word HMM emission matrix 
and is the state-state HMM transition matrix. 

Inference in this model can be done with collapsed Gibbs sampling, analytically integrating 
out all the continuous quantities. See (Griffiths et al. 2004) for the details. 

The results of applying this model (with Æ = 200 LDA topics and C = 20 HMM states) to the 


combined Brown and TASA corpora’ are shown in Table 27.1. We see that the HMM generally is 


4. The Brown corpus consists of 500 documents and 1,137,466 word tokens, with part-of-speech tags for each token. 
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Figure 27.13 LDA-HMM model. 
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Figure 27.14 Function and content words in the NIPS corpus, as distinguished by the LDA-HMM model. 
Graylevel indicates posterior probability of assignment to LDA component, with black being highest. The 
boxed word appears as a function word in one sentence, and as a content word in another sentence. 
Asterisked words had low frequency, and were treated as a single word type by the model. Source: Figure 
4 of (Griffiths et al. 2004). Used with kind permission of Tom Griffiths. 
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Table 27.1 Upper row: Topics extracted by the LDA model when trained on the combined Brown and 
TASA corpora. Middle row: topics extracted by LDA part of LDA-HMM model. Bottom row: topics extracted 
by HMM part of LDA-HMM model. Each column represents a single topic/class, and words appear in order 
of probability in that topic/class. Since some classes give almost all probability to only a few words, a list 
is terminated when the words account for 90% of the probability mass. Source: Figure 2 of (Griffiths et al. 
2004). Used with kind permission of Tom Griffiths. 


responsible for syntactic words, and the LDA for semantics words. If we did not have the HMM, 
the LDA topics would get “polluted” by function words (see top of figure), which is why such 
words are normally removed during preprocessing. 

The model can also help disambiguate when the same word is being used syntactically or 
semantically. Figure 27.14 shows some examples when the model was applied to the NIPS 
corpus. We see that the roles of words are distinguished, e.g., “we require the algorithm to 
return a matrix” (verb) vs “the maximal expected return” (noun). In principle, a part of speech 
tagger could disambiguate these two uses, but note that (1) the LDA-HMM method is fully 
unsupervised (no POS tags were used), and (2) sometimes a word can have the same POS tag, 
but different senses, e.g., “the left graph” (a synactic role) vs “the graph G” (a semantic role). 

The topic of probabilistic models for syntax and semantics is a vast one, which we do not 


The TASA corpus is an untagged collection of educational materials consisting of 37,651 documents and 12,190,931 word 
tokens. Words appearing in fewer than 5 documents were replaced with an asterisk, but punctuation was included. The 
combined vocabulary was of size 37,202 unique words. 

5. NIPS stands for “Neural Information Processing Systems’. It is one of the top machine learning conferences. The 
NIPS corpus volumes 1-12 contains 1713 documents. 
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Figure 27.15 (a) Supervised LDA. (b) Discriminative LDA. 


have space to delve into any mote. See e.g., urafsky and Martin 2008) for further information. 


Supervised LDA 


In this section, we discuss extensions of LDA to handle side information of various kinds beyond 
just words. 


Generative supervised LDA 


Suppose we have a variable length sequence of words y;; € {1,...,V} as usual, but we also 
have a class label c; € {1,...,C}. How can we predict c; from y;? There are many possible 
approaches, but most are direct mappings from the words to the class. In some cases, such 
as sentiment analysis, we can get better performance by first performing inference, to try 
to disambiguate the meaning of words. For example, suppose the goal is to determine if a 
document is a favorable review of a movie or not. If we encounter the phrase “Brad Pitt was 
excellent until the middle of the movie”, the word “excellent” may lead us to think the review is 
positive, but clearly the overall sentiment is negative. 

One way to tackle such problems is to build a joint model of the form p(c;,y;|0). (Blei 
and McAuliffe 2010) proposes an approach, called supervised LDA, where the class label c; is 
generated from the topics as follows: 


p(cilā;) = Ber(sigm(w"q,)) (27.65) 


Here q; is the empirical topic distribution for document i: 


1 
Gin = = 5 Wilk (27.66) 


See Figure 27.15(a) for an illustration. 
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Figure 27.16 Discriminative variants of LDA. (a) Mixture of experts aka MR-LDA. The double ring denotes 
a node that m; a deterministic function of its parents. (b) Mixture of experts with random effects. (c) 
DMR-LDA. 


We can fit this model using Monte Carlo EM: run the collapsed Gibbs sampler in the E step, to 
compute E [@;;,], and then use this as the input feature to a standard logistic regression package. 


Discriminative supervised LDA 


An alternative approach, known as discriminative LDA (Lacoste-Julien et al. 2009), is shown in 
Figure 27.15(b). This is a discriminative model of the form p(y;|c;,@). The only change from 
regular LDA is that the topic prior becomes input dependent, as follows: 


plqalTi, ci = c, 0) = Cat( Aer) (27.67) 


where A, is a K x K stochastic matrix. 

So far, we have assumed the “side information” is a single categorical variable c;. Often we 
have high dimensional covariates x; € RP. For example, consider the task of image tagging. 
The idea is that y; represent correlated tags or labels, which we want to predict given x;. We 
now discuss several attempts to extend LDA so that it can generate tags given the inputs. 

The simplest approach is to use a mixture of experts (Section 11.2.4) with multiple outputs. 
This is just like LDA except we replace the Dirichlet prior on m; with a deterministic function 
of the input: 


mi = S(Wx;) (27.68) 


In (Law et al. 2010), this is called multinomial regression LDA. See Figure 27.16(a). Eliminating 
the deterministic 7; we have 


p(qit|x;, W) = Cat(S(Wx;)) (27.69) 


We can fit this with EM in the usual way. However, (Law et al. 2010) suggest an alternative. 
First fit an unsupervised LDA model based only on y;; then treat the inferred m; as data, and 
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fit a multinomial logistic regression model mapping x; to m;. Although this is fast, fitting LDA 
in an unsupervised fashion does not necessarily result in a discriminative set of latent variables, 
as discussed in (Blei and McAuliffe 2010). 

There is a more subtle problem with this model. Since 7; is a deterministic function of the 
inputs, it is effectively observed, rendering the qi; (and hence the tags yi) independent. In other 
words, 


Li 
plyilx:, 0) = Ho yilXi, 0) = | [X pwialain = k, B)p(qu = k|xi, W) (27.70) 


l=1 k 


This means that if we observe the value of one tag, it will have no influence on any of the 
others. This may explain why the results in (Law et al. 2010) only show negligible improvement 
over predicting each tag independently. 

One way to induce correlations is to make W a random variable. The resulting model is 
shown in Figure 27.16(b). We call this a random effects mixture of experts. We typically 
assume a Gaussian prior on W;. If x; = 1, then p(qu|xi, wi) = Cat(S(w;)), so we recover 
the correlated topic model. It is possible to extend this model by adding Markovian dynamics 
to the qu variables. This is called a conditional topic random field (Zhu and Xing 2010). 

A closely related approach, known as Dirichlet multinomial regression LDA (Mimno and 
McCallum 2008), is shown in Figure 27.16(c). This is identical to standard LDA except we make 
œ a function of the input 


a; = exp(Wx;) (27.71) 
where W is a K x D matrix. Eliminating the deterministic œ; we have 
mi ~ Dir(exp(Wx;)) (27.72) 


Unlike (Law et al. 2010), this model allows information to flow between tags via the latent 7;. 

A variant of this model, where x; corresponds to a bag of discrete labels and m; ~ Dir(a © 
x;), is known as labeled LDA (Ramage et al. 2009). In this case, the labels x; are in 1:1 
correspondence with the latent topics, which makes the resulting topics much more interpretable. 
An extension, known as partially labeled LDA (Ramage et al. 2011), allows each label to have 
multiple latent sub-topics; this model includes LDA, labeled LDA and a multinomial mixture 
model as special cases. 


Discriminative categorical PCA 


An alternative to using LDA is to expand the categorical PCA model with inputs, as shown in 
Figure 27.17(a). Since the latent space is now real-valued, we can use simple linear regression 
for the input-hidden mapping. For the hidden-output mapping, we use traditional catPCA: 
Plyilzi,W) = | J Cat(yi|S(W2:)) (27.74) 
l 


This model is essentially a probabilistic neural network with one hidden layer, as shown in 
Figure 27.17(b), but with exchangeable output (e.g., to handle variable numbers of tags). The 
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Figure 27.17 (a) Categorical PCA with inputs and exchangeable outputs. (b) Same as (a), but with the 
vector nodes expanded out. 


key difference from a neural net is that information can flow between the y's via the latent 
bottleneck layer z;. This should work better than a conventional neural net when the output 
labels are highly correlated, even after conditioning on the features; this problem frequently 
arises in multi label classification. Note that we could allow a direct x; to y; arc, but this would 
require too many parameters if the number of labels is large.® 

We can fit this model with a small modification of the variational EM algorithm in Section 12.4. 
If we use this model for regression, rather than classification, we can perform the E step exactly, 
by modifying the EM algorithm for factor analysis. (Ma et al. 1997) reports that this method 
converges faster than standard backpropagation. 

We can also extend the model so that the prior on z; is a mixture of Gaussians using input- 
dependent means. If the output is Gaussian, this corresponds to a mixture of discriminative 
factor analysers (Fokoue 2005; Zhou and Liu 2008). If the output is categorical, this would be 
an (as yet unpublished) model, which we could call “discriminative mixtures of categorical factor 
analyzers”. 


LVMs for graph-structured data 


Another source of discrete data is when modeling graph or network structures. To see the 
connection, recall that any graph on D nodes can be represented as a D x D adjacency 
matrix G, where G(i, j) = 1 iff there is an edge from node 7 to node j. Such matrices are 
binary, and often very sparse. See Figure 27.19 for an example. 

Graphs arise in many application areas, such as modeling social networks, protein-protein 
interaction networks, or patterns of disease transmission between people or animals. There are 
usually two primary goals when analysing such data: first, try to discover some “interesting 


6. A non-probabilistic version of this idea, using squared loss, was proposed in (ji et al. 2010). This is similar to a linear 
feed-forward neural network with an additional edge from x; directly to y;. 
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Figure 27.18 (a) A directed graph. (b) The same graph, with the nodes partitioned into 3 groups, making 
the block structure more apparent. 
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Figure 27.19 (a) Adjacency matrix for the graph in Figure 27.18(a). (b) Rows and columns are shown 
permuted to show the block structure. We also sketch of how the stochastic block model can generate this 
graph. From Figure 1 of (Kemp et al. 2006). Used with kind permission of Charles Kemp. 


structure” in the graph, such as clusters or communities; second, try to predict which links 
might occur in the future (e.g., who will make friends with whom). Below we summarize some 
models that have been proposed for these tasks, some of which are related to LDA. Futher details 
on these and other approaches can be found in e.g., (Goldenberg et al. 2009) and the references 
therein. 


Stochastic block model 


In Figure 27.18(a) we show a directed graph on 9 nodes. There is no apparent structure. However, 
if we look more deeply, we see it is possible to partition the nodes into three groups or blocks, 
Bı = {1,4,6}, B2 = {2,3,5,8}, and B3 = {7,9}, such that most of the connections go from 
nodes in Bı to Bg, or from Bə to B3, or from B3 to Bı. This is illustrated in Figure 27.18(b). 
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Figure 27.20 Some examples of graphs generated using the stochastic block model with different kinds 
of connectivity patterns between the blocks. The abstract graph (between blocks) represent a ring, a 
dominance hierarchy, a common-cause structure, and a common-effect structure. From Figure 4 of (Kemp 
et al. 2010). Used with kind permission of Charles Kemp. 
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The problem is easier to understand if we plot the adjacency matrices. Figure 27.19(a) shows 
the matrix for the graph with the nodes in their original ordering. Figure 27.19(b) shows the 
matrix for the graph with the nodes in their permtuted ordering. It is clear that there is block 
structure. 

We can make a generative model of block structured graphs as follows. First, for every 
node, sample a latent block q; ~ Cat(7), where my is the probability of choosing block k, for 
k = 1: K. Second, choose the probability of connecting group a to group b, for all pairs of 
groups; let us denote this probability by na ». This can come from a beta prior. Finally, generate 
each edge Rj; using the following model: 


p(Rij =r\qi = a, qj = b, n) = Ber(r|na,b) (27.75) 


This is called the stochastic block model (Nowicki and Snijders 2001). Figure 27.21(a) illustrates 
the model as a DGM, and Figure 27.19(c) illustrates how this model can be used to cluster the 
nodes in our example. 

Note that this is quite different from a conventional clustering problem. For example, we 
see that all the nodes in block 3 are grouped together, even though there are no connections 
between them. What they share is the property that they “like to” connect to nodes in block 1, 
and to receive connections from nodes in block 2. Figure 27.20 illustrates the power of the model 
for generating many different kinds of graph structure. For example, some social networks have 
hierarchical structure, which can be modeled by clustering people into different social strata, 
whereas others consist of a set of cliques. 

Unlike a standard mixture model, it is not possible to fit this model using exact EM, because 
all the latent g; variables become correlated. However, one can use variational EM (Airoldi et al. 
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Figure 27.21 (a) Stochastic block model. (b) Mixed membership stochastic block model. 


2008), collapsed Gibbs sampling (Kemp et al. 2006), etc. We omit the details (which are similar 
to the LDA case). 

In (Kemp et al. 2006), they lifted the restriction that the number of blocks K be fixed, by 
replacing the Dirichlet prior on 7 by a Dirichlet process (see Section 25.2.2). This is known as 
the infinite relational model. See Section 27.6.1 for details. 

If we have features associated with each node, we can make a discriminative version of this 
model, for example by defining 


p( Rig = r|qi = 4,45 = b, Xi, Xj, 0) = Ber(r|w2 pf (Xi, xj)) (27.76) 


where f(x;i,x;) is some way of combining the feature vectors. For example, we could use 
concatenation, [x;,x,], or elementwise product x; ® x; as in supervised LDA. The overall 
model is like a relational extension of the mixture of experts model. 


Mixed membership stochastic block model 


In (Airoldi et al. 2008), they lifted the restriction that each node only belong to one cluster. That 
is, they replaced q; € {1,..., K} with m; € Sx. This is known as the mixed membership 
stochastic block model, and is similar in spirit to fuzzy clustering or soft clustering. Note 
that Tig is not the same as p(z; = k|D); the former represents ontological uncertainty (to 
what degree does each object belong to a cluster) wheras the latter represents epistemological 
uncertainty (which cluster does an object belong to). If we want to combine epistemological 
and ontological uncertainty, we can compute p(7;|D). 

In more detail, the generative process is as follows. First, each node picks a distribution over 
blocks, m; ~ Dir(a). Second, choose the probability of connecting group a to group b, for all 
pairs of groups, ņa ~ (a, 8). Third, for each edge, sample two discrete variables, one for 
each direction: 


Gi+j ~ Cat(mi), qij ~ Cat(7;) (27.77) 
Finally, generate each edge R;; using the following model: 


pA = lag = 0, dig = 0,9) = ap (27.78) 
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Figure 27.22 (a) Who-likes-whom graph for Sampson’s monks. (b) Mixed membership of each monk in 
one of three groups. From Figures 2-3 of (Airoldi et al. 2008). Used with kind permission of Edo Airoldi. 


See Figure 27.21(b) for the DGM. 

Unlike the regular stochastic block model, each node can play a different role, depending on 
who it is connecting to. As an illustration of this, we will consider a data set that is widely used 
in the social networks analysis literature. The data concerns who-likes-whom amongst of group 
of 18 monks. It was collected by hand in 1968 by Sampson (Sampson 1968) over a period of 
months. (These days, in the era of social media such as Facebook, a social network with only 18 
people is trivially small, but the methods we are discussing can be made to scale.) Figure 27.22(a) 
plots the raw data, and Figure 27.22(b) plots E [7r] į for each monk, where K = 3. We see that 
most of the monk belong to one of the three clusters, known as the “young turks”, the “outcasts” 
and the “loyal opposition”. However, some individuals, notably monk 15, belong to two clusters; 
Sampson called these monks the “waverers”. It is interesting to see that the model can recover 
the same kinds of insights as Sampson derived by hand. 

One prevalent problem in social network analysis is missing data. For example, if Ri; = 0, 
it may be due to the fact that person i and j have not had an opportunity to interact, or 
that data is not available for that interaction, as opposed to the fact that these people don't 
want to interact. In other words, absence of evidence is not evidence of absence. We can model 
this by modifying the observation model so that with probability p, we generate a 0 from the 
background model, and we only force the model to explain observed 0s with probability 1 — p. 
In other words, we robustify the observation model to allow for outliers, as follows: 


P(Rij = rldi+g = a, Gig = b,n) = pdo(r) + (1 — p)Ber(r|na,») (27.79) 
See (Airoldi et al. 2008) for details. 


Relational topic model 


In many cases, the nodes in our network have atttributes. For example, if the nodes represent 
academic papers, and the edges represent citations, then the attributes include the text of the 
document itself. It is therefore desirable to create a model that can explain the text and the link 
structure concurrently. Such a model can predict links given text, or even vice versa. 

The relational topic model (RTM) (Chang and Blei 2010) is one way to do this. This is a 
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Figure 27.23 DGM for the relational topic model. 


simple extension of supervised LDA (Section 27.4.4.1), where the response variable R,; (which 
represents whether there is an edge between nodes 7 and j) is modeled as follows: 


p(Rij = 1/q;,4;; 0) = sigm(w’ (q; 8 q;) + wo) (27.80) 


Recall that q; is the empirical topic distribution for document i, 7;, = E yi dilk. See 
Figure 27.23 

Note that it is important that Rij depend on the actual topics chosen, q; and q}, and not 
on the topic distributions, m; and 7j, otherwise predictive performance is not as good. The 
intuitive reason for this is as follows: if Ri; is a child of m; and 7, it will be treated as just 
another word, similar to the q;;’s and yis; but since there are many more words than edges, 
the graph structure information will get “washed out”. By making R;; a child of q; and q,, the 
graph information can influence the choice of topics more directly. 

One can fit this model in a manner similar to SLDA. See (Chang and Blei 2010) for details. 
The method does better at predicting missing links than the simpler approach of first fitting an 
LDA model, and then using the q,’s as inputs to a logistic regression problem. The reason is 
analogous to the superiority of partial least squares (Section 12.5.2) to PCA+ linear regression, 
namely that the RTM learns a latent space that is forced to be predictive of the graph structure 
and words, whereas LDA might learn a latent space that is not useful for predicting the graph. 


LVMs for relational data 


Graphs can be used to represent data which represents the relation amongst variables of a 
certain type, e.g., friendship relationships between people. But often we have multiple types of 
objects, and multiple types of relations. For example, Figure 27.24 illustrates two relations, one 
between people and people, and one between people and movies. 

In general, we define a k-ary relation R as a subset of k-tuples of the appropriate types: 


RCTxTx:---x Tk (27.81) 
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Figure 27.24 Example of relational data. There are two types of objects, people and movies; one 2-ary 
relation, friends: people x people + {0,1} and one 2-ary function, rates: people x movie — R. Age and 
sex are attributes (unary functions) of the people class. 


people 


where T; are sets or types. A binary, pairwise or dyadic relation is a relation defined on pairs 
of objects. For example, the seen relation between people and movies might be represented as 
the set of movies that people have seen. We can either represent this explicitly as a set, such as 


seen = { (Bob, StarWars), (Bob, TombRaider), (Alice, Jaws)} 
or implicitly, using an indicator function for the set: 
seen(Bob, StarWars)=1, seen(Bob, TombRaider)=1, seen(Alice, Jaws)=1 


A relation between two entities of types T! and T? can be represented as a binary function 
R : T! xT? — {0,1}, and hence as a binary matrix. This can also be represented as a bipartite 
graph, in which we have nodes of two types. If T! = T?, this becomes a regular directed graph, 
as in Section 27.5. However, there are some situations that are not so easily modelled by graphs, 
but which can still be modelled by relations. For example, we might have a ternary relation, 
R:T! xT! x T? — {0,1}, where, say, R(i, j,k) = 1 iff protein i interacts with protein j 
when chemical k is present. This can be modelled by a 3d binary matrix. We will give some 
examples of this in Section 27.6.1. 

Making probabilistic models of relational data is called statistical relational learning (Getoor 
and Taskar 2007). One approach is to directly model the relationship between the variables using 
graphical models; this is known as probabilistic relational modeling. Another approach is to 
use latent variable models, as we discuss below. 


Infinite relational model 


It is straightforward to extend the stochastic block model to model relational data: we just 
associate a latent variable qt € {1,..., K,} with each entity i of each type t. We then define 
the probability of the relation holding between specific entities by looking up the probability of 
the relation holding between entities of that type. For example, if R : Tt? x Tt x T? — {0,1}, 
we have 


p(R(i, j,k) = 1q} = 0, q} = b, af = 6, N) = Ma,b,c (27.82) 


If we allow the number of clusters K; for each type to be unbounded, by using a Dirichlet pro- 
cess, the model is called the infinite relational model (IRM) (Kemp et al. 2006). An essentially 
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Figure 27.25 Illustration of an ontology learned by IRM applied to the Unified Medical Language System. 
The boxes represent 7 of the 14 concept clusters. Predicates that belong to the same cluster are grouped 
together, and associated with edges to which they pertain. All links with weight above 0.8 have been 
included. From Figure 9 of (Kemp et al. 2010). Used with kind permission of Charles Kemp. 


identical model, under the name infinite hidden relational model (IHRM), was concurrently 
proposed in (Xu et al. 2006). We can fit this model with variational Bayes (Xu et al. 2006, 2007) 
or collapsed Gibbs sampling (Kemp et al. 2006). Rather than go into algorithmic detail, we just 
sketch some interesting applications. 


Learning ontologies 


An ontology refers to an organisation of knowledge. In AI, ontologies are often built by hand 
(see e.g., (Russell and Norvig 2010)), but it is interesting to try and learn them from data. In 
(Kemp et al. 2006), they show how this can be done using the IRM. 

The data comes from the Unified Medical Language System (McCray 2003), which defines 
a semantic network with 135 concepts (such as “disease or syndrome”, “diagnostic procedure”, 
“animal”), and 49 binary predicates (such as “affects”, “prevents”). We can represent this as a 
ternary relation R : T! x Tt x T? — {0,1}, where T+ is the set of concepts and T? is the 
set of binary predicates. The result is a 3d cube. We can then apply the IRM to partition the 
cube into regions of roughly homogoneous response. The system found 14 concept clusters and 
21 predicate clusters. Some of these are shown in Figure 27.25. The system learns, for example, 
that biological functions affect organisms (since Na b, œ~ 1 where a represents the biological 
function cluster, b represents the organism cluster, and c represents the affects cluster). 


Clustering based on relations and features 


We can also use IRM to cluster objects based on their relations and their features. For example, 
(Kemp et al. 2006) consider a political dataset (from 1965) consisting of 14 countries, 54 binary 
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Figure 27.26 Illustration of IRM applied to some political data containing features and pairwise interac- 
tions. Top row (a). the partition of the countries into 5 clusters and the features into 5 clusters. Every 
second column is labelled with the name of the corresponding feature. Small squares at bottom (a-i): these 
are 8 of the 18 clusters of interaction types. From Figure 6 of (Kemp et al. 2006). Used with kind permission 
of Charles Kemp. 


predicates representing interaction types between countries (e.g., “sends tourists to”, “economic 
aid”), and 90 features (e.g., “communist”, “monarchy”). To create a binary dataset, real-valued 
features were thresholded at their mean, and categorical variables were dummy-encoded. The 
data has 3 types: Ti represents countries, T? represents interactions, and T? represents features. 
We have two relations: R! : Tt x T! xT? — {0,1}, and R? : T! x T? — {0,1}. (This problem 
therefore combines aspects of both the biclustering model and the ontology discovery model.) 
When given multiple relations, the IRM treats them as conditionally independent. In this case, 
we have 


p(R!, R?, q', q°,q?|0) = p(R+ |q", q°, @)p(R*|q', q’, 0) 


The results are shown in Figure 27.26. The IRM divides the 90 features into 5 clusters, the 
first of which contains “noncommunist”, which captures one of the most important aspects of 
this Cold-War era dataset. It also clusters the 14 countries into 5 clusters, reflecting natural 
geo-political groupings (e.g., US and UK, or the Communist Bloc), and the 54 predicates into 18 
clusters, reflecting similar relationships (e.g., “negative behavior and “accusations”). 


(27.83) 
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Probabilistic matrix factorization for collaborative filtering 


As discussed in Section 1.3.4.2, collaborative filtering (CF) requires predicting entries in a matrix 
R : T! x T? — R, where for example R(i, j) is the rating that user i gave to movie j. Thus 
we see that CF is a kind of relational learning problem (and one with particular commercial 
importance). 

Much of the work in this area makes use of the data that Netflix made available in their 
competition. In particular, a large 17,770 x 480,189 movie x user ratings matrix is provided. The 
full matrix would have ~ 8.6 x 10° entries, but only 100,480,507 (about 1%) of the entries are 
observed, so the matrix is extremely sparse. In addition the data is quite imbalanced, with many 
users rating fewer than 5 movies, and a few users rating over 10,000 movies. The validation 
set is 1,408,395 (movie,user) pairs. Finally, there is a separate test set with 2,817,131 (movie,user) 
pairs, for which the ranking is known but withheld from contestants. The performance measure 
is root mean square error: 


N 
1 5 


i=1 


where X (mi, ui) is the true rating of user u; on movie m;, and X (mi, ui) is the prediction. 
The baseline system, known as Cinematch, had an RMSE on the training set of 0.9514, and on 
the test set of 0.9525. To qualify for the grand prize, teams needed to reduce the test RMSE by 
10%, i.e., get a test RMSE of 0.8563 or less. We will discuss some of the basic methods used byt 
the winning team below. 

Since the ratings are drawn from the set {0, 1, 2,3,4,5}, it is tempting to use a categorical 
observation model. However, this does not capture the fact that the ratings are ordered. Although 
we could use an ordinal observation model, in practice people use a Gaussian observation model 
for simplicity. One way to make the model better match the data is to pass the model's predicted 
mean response through a sigmoid, and then to map the [0,1] interval to [0,5] (Salakhutdinov 
and Mnih 2008). Alternatively we can make the data a better match to the Gaussian model by 
transforming the data using Rj; = \/6 — Ri; (Aggarwal and Merugu 2007). 

We could use the IRM for the CF task, by associating a discrete latent variable for each user 


“ and for each movie or video qj and then defining 


di 


aR = raf a, qj b, 0) N (r| a,b; 07) (27.85) 


This is just another example of co-clustering. We can also extend the model to generate side 
information, such as attributes about each user and/or movie. See Figure 27.27 for an illustration. 

Another possibility is to replace the discrete latent variables with continuous latent variables 
mw! € Sg, and T3 € Sg,. However, it has been found (see e.g., (Shan and Banerjee 2010)) that 
one obtains much better results by using unconstrained real-valued latent factors for each user 
u; € R¥ and each movie vj€ RÆ. We then use a likelihood of the form 


p(Rig = rluj, vj) = N(rluz vj, o°) (27.86) 


7. Good results with discrete latent variables have been obtained on some datasets that are smaller than Netflix, such as 
MovieLens and EachMovie. However, these datasets are much easier to predict, because there is less imbalance between 
the number of reviews performed by different users (in Netflix, some users have rated more than 10,000 movies, whereas 
others have rated less than 5). 
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Figure 27.27 Visualization of a small relational dataset, where we have one relation, likes(user, movie), 
and features for movies (here, genre) and users (here, occupation). From Figure 5 of (Xu et al. 2008). Used 
with kind permission of Zhao Xu. 
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Figure 27.28 (a) A DGM for probabilistic matrix factorization. (b) Visualization of the first two factors in 
the PMF model estimated from the Netflix challenge data. Each movie j is plotted at the location specified 
v,;. On the left we have low-brow humor and horror movies (Half Baked, Freddy vs Jason), and on the 
right we have more serious dramas (Sophie's Choice, Moonstruck). On the top we have critically acclaimed 
independent movies (Punch-Drunk Love, I Heart Huckabees), and on the bottom we have mainstream 
Hollywood blockbusters (Armageddon, Runway Bride). The Wizard of Oz is right in the middle of these axes. 
From Figure 3 of (Koren et al. 2009). Used with kind permission of Yehuda Koren. 


This has been called probabilistic matrix factorization (PMF) (Salakhutdinov and Mnih 2008). 
See Figure 27.28(a) for the DGM. The intuition behind this method is that each user and each 
movie get embedded into the same low-dimensional continuous space (see Figure 27.28(b)). If a 
user is close to a movie in that space, they are likely to rate it highly. All of the best entries in 
the Netflix competition used this approach in one form or another.® 

PMF is closely related to the SVD. In particular, if there is no missing data, then computing 
the MLE for the u,’s and the v,’s is equivalent to finding a rank K approximation to R. 
However, as soon as we have missing data, the problem becomes non-convex, as shown in 


8. The winning entry was actually an ensemble of different methods, including PMF, nearest neighbor methods, etc. 
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Figure 27.29 (a) RMSE on the validation set for different PMF variants vs number of passes through 
the data. “SVD” is the unregularized version, Ay = Av = 0. “PMF?” corresponds to Av = 0.01 and 
Av = 0.001, while “PMF2” corresponds to Ay = 0.001 and Ay = 0.0001. “PMFAI” corresponds to a 
version where the mean and diagonal covariance of the Gaussian prior were learned from data. From 
Figure 2 of (Salakhutdinov and Mnih 2008). Used with kind permission of Ruslan Salakhutdinov. (b) RMSE 
on the test set (quiz portion) vs number of parameters for several different models. “Plain” is the baseline 
PMF with suitably chosen Av, Av. “With biases” adds f; and gj offset terms. “With implicit feedback” 
“With temporal dynamics” allows the offset terms to change over time. The Netflix baseline system achieves 
an RMSE of 0.9514, and the grand prize’s required accuracy is 0.8563 (which was obtained on 21 September 


2009). Figure generated by netflixResultsPlot. From Figure 4 of (Koren et al. 2009). Used with kind 
permission of Yehuda Koren. 


(Srebro and Jaakkola 2003), and standard SVD methods cannot be applied. (Recall that in the 
Netflix challenge, only about 1% of the matrix is observed.) 
The most straightforward way to fit the PMF model is to minimize the overall NLL: 


N M 
J(U,V) = — log p(RIU, V, 0) = —log | T] [| Waluya o] O] 2787 


t=1j=1 


where O;;j = 1 if user 7 has seen movie j. Since this is non-convex, we can just find a locally 
optimal MLE. Since the Netflix data is so large (about 100 million observed entries), it is common 
to use stochastic gradient descent (Section 8.5.2) for this task. The gradient for u; is given by 


dJ 
a -4i 321o ig =1)(Ry -u7 v)? =- Ñ. egy; (27.88) 
4 1 pr j: Oij=1 


where e;; = Rij — ulv; is the error term. By stochastically sampling a single movie j that user 
i has watched, the update takes the following simple form: 


uü = u+ NEijVj (27.89) 


where n is the learning rate. The update for v; is similar. 
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Of course, just maximizing the likelihood results in overfitting, as shown in Figure 27.29(a). 
We can regularize this by imposing Gaussian priors: 


p(U, V) = | [Nim Ea) [TM vale, Ze) (27.90) 
If we use u, = H, = 0, E, = of Ik, and X, = of Ix, the new objective becomes 
J(U,V) = —logp(R,U, V|O, 6) (27.91) 
= $) (0y = 1) (Rij -u v) 
i j 
+v ` llull? + Av X. |lvyllż + const (27.92) 
i j 


where we have defined Ay = o?°/oẹ, and Ay = o°/o?,. By varying the regularizers, we can 
reduce the effect of overfitting, as shown in Figure 27.29(a). We can find MAP estimates using 
stochastic gradient descent. We can also compute approximate posteriors using variational Bayes 
(Ilin and Raiko 2010). 

If we use diagonal covariances for the priors, we can penalize each latent dimension by a 
different amount. Also, if we use non-zero means for the priors, we can account for offset terms. 
Optimizing the prior parameters (j2,,, Xu, Hy, Xv) at the same time as the model parameters 
(U, V,a?) is a way to create an adaptive prior. This avoids the need to search for the optimal 
values of Ay and Av, and gives even better results, as shown in Figure 27.29(a). 

It turns out that much of the variation in the data can be explained by movie-specific or 
user-specific effects. For example, some movies are popular for all types of users. And some 
users give low scores for all types of movies. We can model this by allowing for user and movie 
specific offset or bias terms as follows: 


(Rij = rui, vj, 0) =N(rluj vj +u + fi + 95,07) (27.93) 


where u is the overall mean, f; is the user bias, gj is the movie bias, and ul v; is the 
interaction term. This is equivalent to applying PMF just to the residual matrix, and gives much 
better results, as shown in Figure 27.29(b). We can estimate the f;, gj and js terms using 
stochastic gradient descent, just as we estimated U, V and 0. 

We can also allow the bias terms to evolve over time, to reflect the changing preferences of 
users (Koren 2009b). This is important since in the Netflix competition, the test data was more 
recent than the training data. Figure 27.29(b) shows that allowing for temporal dynamics can 
help a lot. 

Often we also have side information of various kinds. In the Netflix competition, entrants 
knew which movies the user had rated in the test set, even though they did not know the 
values of these ratings. That is, they knew the value of the (dense) O matrix even on the 
test set. If a user chooses to rate a movie, it is likely because they have seen it, which in 
turns means they thought they would like it. Thus the very act of rating reveals information. 
Conversely, if a user chooses not rate a movie, it suggests they knew they would not like it. 
So the data is not missing at random (see e.g., (Marlin and Zemel 2009)). Exploiting this can 
improve performance, as shown in Figure 27.29(b). In real problems, information on the test set 
is not available. However, we often know which movies the user has watched or declined to 
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watch, even if they did not rate them (this is called implicit feedback), and this can be used as 
useful side information. 

Another source of side information concerns the content of the movie, such as the movie 
genre, the list of the actors, or a synopsis of the plot. This can be denoted by x,, the features 
of the video. (In the case where we just have the id of the video, we can treat x, as a |V|- 
dimensional bit vector with just one bit turned on.) We may also know features about the user, 
which we can denote by x,,. In some cases, we only know if the user clicked on the video or 
not, that is, we may not have a numerical rating. We can then modify the model as follows: 


p(R(u, v)|Xu, Xv, 9) = Ber(R(u, v)|(Ux,)7 (Vx,)) (27.94) 


where U is a |U| x K matrix, and V is a |V| x K matrix (we can incorporate an offset term 
by appending a 1 to x, and x, in the usual way). A method for computing the approximate 
posterior p(U, V|D) in an online fashion, using ADF and EP, was described in (Stern et al. 
2009). This was implemented by Microsoft and has been deployed to predict click through rates 
on all the ads used by Bing. 

Unfortunately, fitting this model just from positive binary data can result in an over prediction 
of links, since no negative examples are included. Better performance is obtained if one has 
access to the set of all videos shown to the user, of which at most one was picked; data of this 
form is known as an impression log. In this case, we can use a multinomial model instead of 
a binary model; in (Yang et al. 2011), this was shown to work much better than a binary model. 
To understand why, suppose some is presented with a choice of an action movie starring Arnold 
Schwarzenegger, an action movie starring Vin Diesel, and a comedy starring Hugh Grant. If 
the user picks Arnold Schwarzenegger, we learn not only that they like prefer action movies to 
comedies, but also that they prefer Schwarzenegger to Diesel. This is more informative than just 
knowing that they like Schwarzenegger and action movies. 


Restricted Boltzmann machines (RBMs) 


So far, all the models we have proposed in this chapter have been representable by directed 
graphical models. But some models are better represented using undirected graphs. For example, 
the Boltzmann machine (Ackley et al. 1985) is a pairwise MRF with hidden nodes h and visible 
nodes v, as shown in Figure 27.30(a). The main problem with the Boltzmann machine is that 
exact inference is intractable, and even approximate inference, using e.g., Gibbs sampling, can 
be slow. However, suppose we restrict the architecture so that the nodes are arranged in layers, 
and so that there are no connections between nodes within the same layer (see Figure 27.30(b)). 
Then the model has the form 


R kK 
1 
p(h, v|@) = Z(0) lI lI Ura (Ur; hg) (27.95) 


where R is the number of visible (response) variables, K is the number of hidden variables, and 
v plays the role of y earlier in this chapter. This model is known as a restricted Boltzmann 
machine (RBM) (Hinton 2002), or a harmonium (Smolensky 1986). 

An RBM is a special case of a product of experts (PoE) (Hinton 1999), which is so-called 
because we are multiplying together a set of “experts” (here, potential functions on each edge) 
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Figure 27.30 (a) A general Boltzmann machine, with an arbitrary graph structure. The shaded (visible) 
nodes are partitioned into input and output, although the model is actually symmetric and defines a joint 
density on all the nodes. (b) A restricted Boltzmann machine with a bipartite structure. Note the lack of 
intra-layer connections. 


and then normalizing, whereas in a mixture of experts, we take a convex combination of 
normalized distributions. The intuitive reason why PoE models might work better than a mixture 
is that each expert can enforce a constraint (if the expert has a value which is > 1 or < 1) 
or a “don’t care” condition (if the expert has value 1). By multiplying these experts together 
in different ways we can create “sharp” distributions which predict data which satisfies the 
specified constraints (Hinton and Teh 2001). For example, consider a distributed model of text. 
A given document might have the topics “government”, “mafia” and “playboy”. If we “multiply” 
the predictions of each topic together, the model may give very high probability to the word 
“Berlusconi”? (Salakhutdinov and Hinton 2010). By contrast, adding together experts can only 
make the distribution broader (see Figure 14.17). 

Typically the hidden nodes in an RBM are binary, so h specifies which constraints are active. 
It is worth comparing this with the directed models we have discussed. In a mixture model, we 
have one hidden variable q € {1,..., K}. We can represent this using a set of K bits, with the 
restriction that exactly one bit is on at a time. This is called a localist encoding, since only 
one hidden unit is used to generate the response vector. This is analogous to the hypothetical 
notion of grandmother cells in the brain, that are able to recognize only one kind of object. 
By contrast, an RBM uses a distributed encoding, where many units are involved in generating 
each output. Models that used vector-valued hidden variables, such as m € Sx, as in mPCA/ 
LDA, or z € RÝ, as in ePCA also use distributed encodings. 

The main difference between an RBM and directed two-layer models is that the hidden 
variables are conditionally independent given the visible variables, so the posterior factorizes: 


p(hlv, 0) = | [ p(helv, 0) (27.96) 
k 


This makes inference much simpler than in a directed model, since we can estimate each hę 


9. Silvio Berlusconi is the current (2011) prime minister of Italy. 
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Visible Hidden Name Reference 

Binary Binary Binary RBM (Hinton 2002) 

Gaussian Binary Gaussian RBM (Welling and Sutton 2005) 
Categorical Binary Categorical RBM (Salakhutdinov et al. 2007) 
Multiple categorical Binary Replicated softmax/ undirected LDA (Salakhutdinov and Hinton 2010) 
Gaussian Gaussian Undirected PCA (Marks and Movellan 2001) 
Binary Gaussian Undirected binary PCA (Welling and Sutton 2005) 


Table 27.2 Summary of different kinds of RBM. 


independently and in parallel, as in a feedforward neural network. The disadvantage is that 
training undirected models is much harder, as we discuss below. 


Varieties of RBMs 


In this section, we describe various forms of RBMs, by defining different pairwise potential 
functions. See Table 27.2 for a summary. All of these are special cases of the exponential 
family harmonium (Welling et al. 2004). 


Binary RBMs 


The most common form of RBM has binary hidden nodes and binary visible nodes. The joint 
distribution then has the following form: 


p(v,h|@) = zi exp(—E(v, h; @)) (27.97) 

E(v,h;@) ê — z UrhkWrk — - Urbr — 5 hkCk (27.98) 
r=1k=1 

= ~(v’Wh+v'?b+ Wo (27.99) 


Z(@0) = DD (v, h; 0)) (27.100) 


where F is the energy function, W is a R x K weight matrix, b are the visible bias terms, c are 
the hidden bias terms, and 0 = (W, b, c) are all the parameters. For notational simplicity, we 
will absorb the bias terms into the weight matrix by clamping dummy units v9 = 1 and hp = 1 
and setting wo,. = c and w.o = b. Note that naively computing Z(0) takes O(2"2*) time 
but we can reduce this to O(min{ R2“, K2"}) time (Exercise 27.1). 

When using a binary RBM, the posterior can be computed as follows: 


p(hiv, 8) - Tn hg|v, 0) = hy, |sigma(w 7.) (27.101) 


By symmetry, one can show that we can generate data given the hidden variables as follows: 


p(v{h, 0) = [Tot p(vr|h, 0) = | | Ber(v,|sigm(w7.h)) (27.102) 


27.7.1.2 
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We can write this in matrix-vetor notation as follows: 


z [h|vð] = sigm(W7v) (27.103) 
S[v|h,@] = sigm(Wh) (27.104) 


The weights in W are called the generative weights, since they are used to generate the 
observations, and the weights in WT are called the recognition weights, since they are used 
to recognize the input. 

From Equation 27.101, we see that we activate hidden node k in proportion to how much the 
input vector v “looks like” the weight vector w. % (up to scaling factors). Thus each hidden node 
captures certain features of the input, as encoded in its weight vector, similar to a feedforward 
neural network. 


Categorical RBM 


We can extend the binary RBM to categorical visible variables by using a 1-of-C’ encoding, 
where C is the number of states for each vir. We define a new energy function as follows 
(Salakhutdinov et al. 2007; Salakhutdinov and Hinton 2010): 


R K CG K 
E(v,h;0) ê “2 DD VEh WS, E 3 vbe — 5 heck (27.105) 
r=l k=1 gel yel gal k=1 
The full conditionals are given by 
p(vp|h,@) = Cat(S({bo + 2 hW) (27.106) 
p(hy = 1ļc,0) = sigm( + > a ©) (27.107) 


Gaussian RBM 


We can generalize the model to handle real-valued data. In particular, a Gaussian RBM has the 
following energy function: 


R 


K 
(v, h|8) = D S Wik >D b)? Dash (27.108) 


r=lk=1 r=1 


The parameters of the model are O = (Wrk, ax, br). (We have assumed the data is standardized, 
so we fix the variance to a? = 1.) Compare this to a Gaussian in information form: 


1 
N.(v|n, A) x exp(n? v — 3V Av) (27.109) 


where 7 = Ap. We see that we have set A = I, and n = )°,h,yw.,. Thus the mean is 
given by uw = Atn = > p hkW:,k. The full conditionals, which are needed for inference and 


27.7.1.4 
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learning, are given by 


p(v,|h,A) = N(vp|b +X. wrehe, 1) (27.110) 
k 


p(he =1\v,8) = sigm (a +5 unr) (27.111) 


r 


We see that each visible unit has a Gaussian distribution whose mean is a function of the 
hidden bit vector. More powerful models, which make the (co)variance depend on the hidden 
state, can also be developed (Ranzato and Hinton 2010). 


RBMs with Gaussian hidden units 


If we use Gaussian latent variables and Gaussian visible variables, we get an undirected version 
of factor analysis. However, it turns out that it is identical to the standard directed version 
(Marks and Movellan 2001). 

If we use Gaussian latent variables and categorical observed variables, we get an undirected 
version of categorical PCA (Section 27.2.2). In (Salakhutdinov et al. 2007), this was applied to the 
Netflix collaborative filtering problem, but was found to be significantly inferior to using binary 
latent variables, which have more expressive power. 


Learning RBMs 


In this section, we discuss some ways to compute ML parameter estimates of RBMs, using 
gradient-based optimizers. It is common to use stochastic gradient descent, since RBMs often 
have many parameters and therefore need to be trained on very large datasets. In addition, it is 
standard to use fj regularization, a technique that is often called weight decay in this context. 
This requires a very small change to the objective and gradient, as discussed in Section 8.3.6. 


Deriving the gradient using p(h, v|6@) 


To compute the gradient, we can modify the equations from Section 19.5.2, which show how to 
fit a generic latent variable maxent model. In the context of the Boltzmann machine, we have 
one feature per edge, so the gradient is given by 


oe: ič, , 
ie TA = [uh |vi, 0] — E [vrhyl0] (27.112) 
i i=1 


We can write this in matrix-vector form as follows: 


Vw L= Ep,,,,((@) [VhT] — Epio) [vh] (27.113) 
where Pemp(v, h|0) = p(h|v,@)pemp(v), and Pemp(v) = ay dy,(v) is the empirical 
distribution. (We can derive a similar expression for the bias terms by setting v, = 1 or 
hy = 1) 


The first term on the gradient, when v is fixed to a data case, is sometimes called the 
clamped phase, and the second term, when v is free, is sometimes called the unclamped 
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phase. When the model expectations match the empirical expectations, the two terms cancel 
out, the gradient becomes zero and learning stops. This algorithm was first proposed in (Ackley 
et al. 1985). The main problem is efficiently computing the expectations. We discuss some ways 


to do this below. 


27.7.2.2 Deriving the gradient using p(v|@) 


We now present an alternative way to derive Equation 27.112, which also applies to other energy 
based models. First we marginalize out the hidden variables and write the RBM in the form 


p(v|O) = ZO exp(—F (v; 0)), where F(v;0) is the free energy: 


F(v) ê $ E(v,h)= Tow (Soy euma) 
= loo (Soma) 
I 


rS1 


>X exp (>. wht) 


r=1 


| 
K R 
il (: a) 


Next we write the (scaled) oe in the following form: 
IA 
(0) = 5 dept v;|0) = wor vi|0) — log Z(@) 


Using the fact that 7(@) = X, exp(_F is 0)) we have 


II 


II 


Vea) = = aa 


II 
l 
jai 
M 
< 
5 
= 
+ 
M 
< 
Y 
= 
fo} 
5 
5 
3 


N 
= - > VF(v:) +E[VF(v)] 


Plugging in the free energy (Equation 27.117), one can show that 
o 

Wrk 
Hence 


F(v) = —0,E [hx|v, 0] = —E [v,helv, 0] 


ð 1 Č 
wy) = > [up hy|v, 0] — E [v,hy|] 


which matches Equation 27.112. 


(27.114) 


(27.115) 


(27.116) 


(27.117) 


(27.118) 


(27.119) 


(27.120) 


(27.121) 


(27.122) 


(27.123) 
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<XiHj> 
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T= 1: 1-step T = infinity: equilibrium 
reconstructions samples 


Figure 27.31 Illustration of Gibbs sampling in an RBM. The visible nodes are initialized at a datavector, 
then we sample a hidden vector, then another visible vector, etc. Eventually (at “infinity”) we will be 
producing samples from the joint distribution p(v, h|@). 


Approximating the expectations 


We can approximate the expectations needed to evaluate the gradient by performing block 
Gibbs sampling, using Equations 27.101 and 27.102. In more detail, we can sample from the 
joint distribution p(v,h|@) as follows: initialize the chain at vu! (e.g. by setting v! = v; for 
some data vector), and then sample from h! ~ p(h|v‘), then from v? ~ p(v|h!), then from 
h? ~ p(h|v?), etc. See Figure 27.31 for an illustration. Note, however, that we have to wait until 
the Markov chain reaches equilibrium (i.e., until it has “burned in”) before we can interpret the 
samples as coming from the joint distribution of interest, and this might take a long time. 

A faster alternative is to use mean field, where we make the approximation E [vrh] ~ 
ù [ur] E [hk]. However, since p(v, h) is typically multimodal, this is usually a very poor approx- 
imation, since it will average over different modes (see Section 21.2.2). Furthermore, there is a 
more subtle reason not to use mean field: since the gradient has the form E [v;h,|v] —E [v,hx], 
we see that the negative sign in front means that the method will try to make the variational 
bound as loose as possible (Salakhutdinov and Hinton 2009). This explains why earlier attempts 
to use mean field to learn Boltzmann machines (e.g., (Kappen and Rodriguez 1998)) did not work. 


Contrastive divergence 


The problem with using Gibbs sampling to compute the gradient is that it is slow. We now 
present a faster method known as contrastive divergence or CD (Hinton 2002). CD was 
originally derived by approximating an objective function defined as the difference of two KL 
divergences, rather than trying to maximize the likelihood itself. However, from an algorithmic 
point of view, it can be thought of as similar to stochastic gradient descent, except it approxi- 
mates the “unclamped” expectations with “brief” Gibbs sampling where we initialize each Markov 
chain at the data vectors. That is, we approximate the gradient for one datavector as follows: 


Vw la E [vh"]|v;] — E, [vh"] (27.124) 


where q corresponds to the distribution generated by K up-down Gibbs sweeps, started at v;, 
as in Figure 27.31. This is known as CD-K. In more detail, the procedure (for K = 1) is as 


27.7.2.9 
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follows: 
h; ~ p(hlv;, 0) (27.125) 
v; ~ p(v|h;, 0) (27.126) 
h; ~ p(hļv;, 0) (27.127) 


We then make the approximation 


tq [vh] ~ v;(h;)" (27.128) 


Such samples are sometimes called fantasy data. We can think of v; as the model’s best 
attempt at reconstructing v; after being coded and then decoded by the model. This is similar 
to the way we train auto-encoders, which are models which try to “squeeze” the data through a 
restricted parametric “bottleneck” (see Section 28.3.2). 

In practice, it is common to use E [h|v‘] instead of a sampled value h/, in the final upwards 
pass, since this reduces the variance. However, it is not valid to use E [h|v;] instead of sampling 
h; ~ p(h|v;) in the earlier upwards passes, because then each hidden unit would be able to 
pass more than 1 bit of information, so it would not act as much of a bottleneck. 

The whole procedure is summarized in Algorithm 3. (Note that we follow the positive gradient 
since we are maximizing likelihood.) Various tricks can be used to speed this algorithm up, such 
as using a momentum term (Section 8.3.2), using mini-batches, averaging the updates, etc. Such 
details can be found in (Hinton 2010; Swersky et al. 2010). 


Algorithm 27.3: CD-1 training for an RBM with binary hidden and visible units 
1 Initialize weights W € R®** randomly; 


2 t:=0; 

3 for each epoch do 

4 t:=t+1; 

5 for each minibatch of size B do 

6 Set minibatch gradient to zero, g := 0 ; 

7 for each case v; in the minibatch do 

8 Compute u; = E [h|v;, W]; 

9 Sample h; ~ p(h|v;, W); 

10 Sample v; ~ p(v|h;, W); 

u Compute u; = E [h|v;, W]; 

12 Compute gradient Vw = (vi) (p;)® — (vi) (Ww)? ; 
13 Accumulate g := g + Vw; 

4 Update parameters W := W + (a;/B)g 


Persistent CD 


In Section 19.5.5, we presented a technique called stochastic maximum likelihood (SML) for 
fitting maxent models. This avoids the need to run MCMC to convergence at each iteration, 
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by exploiting the fact that the parameters are changing slowly, so the Markov chains will not 
be pushed too far from equilibrium after each update (Younes 1989). In other words, there are 
two dynamical processes running at different time scales: the states change quickly, and the 
parameters change slowly. This algorithm was independently rediscovered in (Tieleman 2008), 
who called it persistent CD. See Algorithm 3 for the pseudocode. 

PCD often works better than CD (see e.g., (Tieleman 2008; Marlin et al. 2010; Swersky et al. 
2010)), although CD can be faster in the early stages of learning. 


Algorithm 27.4: Persistent CD for training an RBM with binary hidden and visible units 


1 Initialize weights W € R?*” randomly; 
2 Initialize chains (v,,h,)2_, randomly ; 
3 fort =1,2,... do 
4 // Mean field updates ; 
for each case i = 1 : N do 
hik = sigm(vi w. p) 
7 [| MCMC updates ; 
for each sample s = 1 : S do 
9 Generate (vs, hs) by brief Gibbs sampling from old (vs, hs) 


5 
6 


10 // Parameter updates ; 
N S 
u 8= N Da vi(u;)” — 4 s=1 vs (hs)? ; 
12 W := W + ag; 
13 Decrease a; 


Applications of RBMs 


The main application of RBMs is as a building block for deep generative models, which we 
discuss in Section 28.2. But they can also be used as substitutes for directed two-layer models. 
They are particularly useful in cases where inference of the hidden states at test time must be 
fast. We give some examples below. 


Language modeling and document retrieval 


We can use a categorical RBM to define a generative model for bag-of-words, as an alternative 
to LDA. One subtlety is that the partition function in an undirected models depends on how 
big the graph is, and therefore on how long the document is. A solution to this was proposed 
in (Salakhutdinov and Hinton 2010): use a categorical RBM with tied weights, but multiply the 
hidden activation bias terms cg by the document length L to compensate form the fact that the 
observed word-count vector v is larger in magnitude: 


K C 


Cc K 
E(v,h;6) 2 -XOXO hW- Y v- LY hrer (27.129) 
c=1 k=1 


k=1c=1 
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Data set Number of docs K D St. Dev. Avg. Test perplexity per word (in nats) 
Train Test LDA-50 LDA-200 R.Soft-50 Unigram 
NIPS 1,690 50 13,649 98.0 245.3 3576 3391 3405 4385 
20-news 11,314 7,531 2,000 51.8 70.8 1091 1058 953 1335 
Reuters 794,414 10,000 10,000 94.6 69.3 1437 1142 988 2208 


Figure 27.32 Comparison of RBM (replicated softmax) and LDA on three corpora. K is the number of 
words in the vocabulary, D is the average document length, and St. Dev. is the standard deviation of the 
document length. Source: (Salakhutdinov and Hinton 2010) . 
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Figure 27.33 Precision-recall curves for RBM (replicated softmax) and LDA on two corpora. From Figure 
3 of (Salakhutdinov and Hinton 2010). Used with kind permission of Ruslan Salakhutdinov. 


where v° = > I(yit = c). This is like having a single multinomial node (so we have dropped 
the r subscript) with C states, where C' is the number of words in the vocabulary. This is 
called the replicated softmax model (Salakhutdinov and Hinton 2010), and is an undirected 
alternative to mPCA/ LDA. 

We can compare the modeling power of RBMs vs LDA by measuring the perplexity on a test 
set. This can be approximated using annealing importance sampling (Section 24.6.2). The results 
are shown in Figure 27.32. We see that the LDA is significantly better than a unigram model, 
but that an RBM is significantly better than LDA. 

Another advantage of the LDA is that inference is fast and exact: just a single matrix-vector 
multiply followed by a sigmoid nonlinearity, as in Equation 27.107. In addition to being faster, 
the RBM is more accurate. This is illustrated in Figure 27.33, which shows precision-recall curves 
for RBMs and LDA on two different corpora. These curves were generated as follows: a query 
document from the test set is taken, its similarity to all the training documents is computed, 
where the similarity is defined as the cosine of the angle between the two topic vectors, and 
then the top M documents are returned for varying M. A retrieved document is considered 


relevant if it has the same class label as that of the query’s (this is the only place where labels 
are used). 
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RBMs for collaborative filtering 


RBMs have been applied to the Netflix collaborative filtering competition (Salakhutdinov et al. 
2007). In fact, an RBM with binary hidden nodes and categorical visible nodes can slightly 
outperform SVD. By combining the two methods, performance can be further improved. (The 
winning entry in the challenge was an ensemble of many different types of model (Koren 2009a).) 


Exercises 


Exercise 27.1 Partition function for an RBM 


Show how to compute Z(0) for an RBM with K binary hidden nodes and R binary observed nodes in 
O(R2*) time, assuming K < R. 


28.1 


28.2 


Deep learning 


Introduction 


Many of the models we have looked at in this book have a simple two-layer architecture of 
the form z — y for unsupervised latent variable models, or x — y for supervised models. 
However, when we look at the brain, we seem many levels of processing. It is believed that each 
level is learning features or representations at increasing levels of abstraction. For example, the 
standard model of the visual cortex (Hubel and Wiesel 1962; Serre et al. 2005; Ranzato et al. 
2007) suggests that (roughly speaking) the brain first extracts edges, then patches, then surfaces, 
then objects, etc. (See e.g., (Palmer 1999; Kandel et al. 2000) for more information about how 
the brain might perform vision.) 

This observation has inspired a recent trend in machine learning known as deep learning 
(see e.g., (Bengio 2009), deeplearning.net, and the references therein), which attempts to 
replicate this kind of architecture in a computer. (Note the idea can be applied to non-vision 
problems as well, such as speech and language.) 

In this chapter, we give a brief overview of this new field. However, we caution the reader 
that the topic of deep learning is currently evolving very quickly, so the material in this chapter 
may soon be outdated. 


Deep generative models 


Deep models often have millions of parameters. Acquiring enough labeled data to train such 
models is diffcult, despite crowd sourcing sites such as Mechanical Turk. In simple settings, such 
as hand-written character recognition, it is possible to generate lots of labeled data by making 
modified copies of a small manually labeled training set (see e.g., Figure 16.13), but it seems 
unlikely that this approach will scale to complex scenes.! 

To overcome the problem of needing labeled training data, we will focus on unsupervised 
learning. The most natural way to perform this is to use generative models. In this section, we 
discuss three different kinds of deep generative models: directed, undirected, and mixed. 


1. There have been some attempts to use computer graphics and video games to generate realistic-looking images of 
complex scenes, and then to use this as training data for computer vision systems. However, often graphics programs 
cut corners in order to make perceptually appealing images which are not reflective of the natural statistics of real-world 
images. 
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(a) (b) (c) 


Figure 28.1 Some deep multi-layer graphical models. Observed variables are at the bottom. (a) A directed 
model. (b) An undirected model (deep Boltzmann machine). (c) A mixed directed-undirected model (deep 
belief net). 


Deep directed networks 


Perhaps the most natural way to build a deep generative model is to construct a deep directed 
graphical model, as shown in Figure 28.1(a). The bottom level contains the observed pixels (or 
whatever the data is), and the remaining layers are hidden. We have assumed just 3 layers for 
notational simplicity. The number and size of layers is usually chosen by hand, although one 
can also use non-parametric Bayesian methods (Adams et al. 2010) or boosting (Chen et al. 2010) 
to infer the model structure. 

We shall call models of this form deep directed networks or DDNs. If all the nodes are 
binary, and all CPDs are logistic functions, this is called a sigmoid belief net (Neal 1992). In 
this case, the model defines the following joint distribution: 


p(hı, hz, hs, v|0) = | | Ber(v;|sigm(hf wo.) | [ Ber(hı;lsism(h} wi;)) (28.1) 
i J 


] [ Ber(hea|sigm (hg woe) | [ Ber(hsi|wsi) (28.2) 
k l 


Unfortunately, inference in directed models such as these is intractable because the posterior 
on the hidden nodes is correlated due to explaining away. One can use fast mean field approxi- 
mations (Jaakkola and Jordan 1996a; Saul and Jordan 2000), but these may not be very accurate, 
since they approximate the correlated posterior with a factorial posterior. One can also use 
MCMC inference (Neal 1992; Adams et al. 2010), but this can be quite slow because the variables 
are highly correlated. Slow inference also results in slow learning. 


Deep Boltzmann machines 


A natural alternative to a directed model is to construct a deep undirected model. For example, 
we can stack a series of RBMs on top of each other, as shown in Figure 28.1(b). This is known 
as a deep Boltzmann machine or DBM (Salakhutdinov and Hinton 2009). If we have 3 hidden 
layers, the model is defined as follows: 


1 
p(hı, ho, hg, v|0) = Z(6) exp So vias Wiis + 5 hiıjhajW2jk + 5 həkh3ıW3pı | (28.3) 
= ak kl 
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where we are ignoring constant offset or bias terms. 

The main advantage over the directed model is that one can perform efficient block (layer- 
wise) Gibbs sampling, or block mean field, since all the nodes in each layer are conditionally 
independent of each other given the layers above and below (Salakhutdinov and Larochelle 
2010). The main disadvantage is that training undirected models is more difficult, because of the 
partition function. However, below we will see a greedy layer-wise strategy for learning deep 
undirected models. 


Deep belief networks 


An interesting compromise is to use a model that is partially directed and partially undirected. 
In particular, suppose we construct a layered model which has directed arrows, except at the 
top, where there is an undirected bipartite graph, as shown in Figure 28.1(c). This model is 
known as a deep belief network (Hinton et al. 2006) or DBN. If we have 3 hidden layers, the 
model is defined as follows: 


p(bi,ho,hs,vj@) = | J Ber(v;|sigm(h{ wii) | | Ber(ha,|sigm(h3 w325) (28.4) 
i J 


1 
sm P həkh3ıW3ki (28.5) 
ae (5 


Essentially the top two layers act as an associative memory, and the remaining layers then 
generate the output. 

The advantage of this peculiar architecture is that we can infer the hidden states in a 
fast, bottom-up fashion. To see why, suppose we only have two hidden layers, and that 
W,» = wi, so the second level weights are tied to the first level weights (see Figure 28.2(a)). 
This defines a model of the form p(hı,h2,v|W1). One can show that the distribution 
p(hi,v|W1) = doy, p(h1, be, v|W1) has the form p(hi, v|Wi) = zewg exp(v? Wihi), 
which is equivalent to an RBM. Since the DBN is equivalent to the RBM as far as p(hy, v|W1) 
is concerned, we can infer the posterior p(hı|v, W1) in the DBN exactly as in the RBM. This 
posterior is exact, even though it is fully factorized. 

Now the only way to get a factored posterior is if the prior p(hı|W1) is a complementary 
prior. This is a prior which, when multiplied by the likelihood p(v|h,), results in a perfectly 
factored posterior. Thus we see that the top level RBM in a DBN acts as a complementary prior 
for the bottom level directed sigmoidal likelihood function. 

If we have multiple hidden levels, and/or if the weights are not tied, the correspondence 
between the DBN and the RBM does not hold exactly any more, but we can still use the factored 
inference rule as a form of approximate bottom-up inference. Below we show that this is a valid 
variational lower bound. This bound also suggests a layer-wise training strategy, that we will 
explain in more detail later. Note, however, that top-down inference in a DBN is not tractable, 
so DBNs are usually only used in a feedforward manner. 


2. Unforuntately the acronym “DBN” also stands for “dynamic Bayes net” (Section 17.6.7). Geoff Hinton, who invented 
deep belief networks, has suggested the acronyms DeeBNs and DyBNs for these two different meanings. However, this 
terminology is non-standard. 
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Figure 28.2 (a) A DBN with two hidden layers and tied weights that is equivalent to an RBM. Source: 
Figure 2.2 of (Salakhutdinov 2009). (b) A stack of RBMs trained greedily. (c) The corresponding DBN. 
Source: Figure 2.3 of (Salakhutdinov 2009). Used with kind permission of Ruslan Salakhutdinov. 


Greedy layer-wise learning of DBNs 
The equivalence between DBNs and RBMs suggests the following strategy for learning a DBN. 


e Fit an RBM to learn Wj using methods described in Section 27.7.2. 

e Unroll the RBM into a DBN with 2 hidden layers, as in Figure 28.2(a). Now “freeze” the 
directed weights W and let W3 be “untied” so it is no longer forced to be equal to WT. 
We will now learn a better prior for p(hı|W2) by fitting a second RBM. The input data to 
this new RBM is the activation of the hidden units E [hi|v,W,] which can be computed 
using a factorial approximation. 


e Continue to add more hidden layers until some stopping criterion is satisified, e.g., you run 
out of time or memory, or you start to overfit the validation set. Construct the DBN from 
these RBMs, as illustrated in Figure 28.2(c). 


One can show (Hinton et al. 2006) that this procedure always increases a lower bound the 
observed data likelihood. Of course this procedure might result in overfitting, but that is a 
different matter. 

In practice, we want to be able to use any number of hidden units in each level. This means 
we will not be able to initialize the weights so that W; = WH. This voids the theoretical 
guarantee. Nevertheless the method works well in practice, as we will see. The method can also 
be extended to train DBMs in a greedy way (Salakhutdinov and Larochelle 2010). 

After using the greedy layer-wise training strategy, it is standard to “fine tune” the weights, 
using a technique called backfitting. This works as follows. Perform an upwards sampling pass 
to the top. Then perform brief Gibbs sampling in the top level RBM, and perform a CD update 
of the RBM parameters. Finally, perform a downwards ancestral sampling pass (which is an 
approximate sample from the posterior), and update the logistic CPD parameters using a small 
gradient step. This is called the up-down procedure (Hinton et al. 2006). Unfortunately this 
procedure is very slow. 
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Deep neural networks 


Given that DBNs are often only used in a feed-forward, or bottom-up, mode, they are effectively 
acting like neural networks. In view of this, it is natural to dispense with the generative story 
and try to fit deep neural networks directly, as we discuss below. The resulting training methods 
are often simpler to implement, and can be faster. 

Note, however, that performance with deep neural nets is sometimes not as good as with 
probabilistic models (Bengio et al. 2007). One reason for this is that probabilistic models support 
top-down inference as well as bottom-up inference. (DBNs do not support efficient top-down 
inference, but DBMs do, and this has been shown to help (Salakhutdinov and Larochelle 2010).) 
Top-down inference is useful when there is a lot of ambiguity about the correct interpretation 
of the signal. 

It is interesting to note that in the mammalian visual cortex, there are many more feedback 
connections than there are feedforward connections (see e.g., (Palmer 1999; Kandel et al. 2000)). 
The role of these feedback connections is not precisely understood, but they presumably provide 
contextual prior information (e.g., coming from the previous “frame” or retinal glance) which 
can be used to disambiguate the current bottom-up signals (Lee and Mumford 2003). 

Of course, we can simulate the effect of top-down inference using a neural network. However 
the models we discuss below do not do this. 


Deep multi-layer perceptrons 


Many decision problems can be reduced to classification, e.g., predict which object (if any) is 
present in an image patch, or predict which phoneme is present in a given acoustic feature 
vector. We can solve such problems by creating a deep feedforward neural network or multi- 
layer perceptron (MLP), as in Section 16.5, and then fitting the parameters using gradient descent 
(aka back-propagation). 

Unfortunately, this method does not work very well. One problem is that the gradient becomes 
weaker the further we move away from the data; this is known as the “vanishing gradient” 
problem (Bengio and Frasconi 1995). A related problem is that there can be large plateaus in 
the error surface, which cause simple first-order gadient-based methods to get stuck (Glorot and 
Bengio 2010). 

Consequently early attempts to learn deep neural networks proved unsuccesful. Recently there 
has been some progress, due to the adoption of GPUs (Ciresan et al. 2010) and second-order 
optimization algorithms (Martens 2010). Nevertheless, such models remain difficult to train. 

Below we discuss a way to initialize the parameters using unsupervised learning; this is called 
generative pre-training. The advantage of performing unsupervised learning first is that the 
model is forced to model a high-dimensional response, namely the input feature vector, rather 
than just predicting a scalar response. This acts like a data-induced regularizer, and helps 
backpropagation find local minima with good generalization properties (Erhan et al. 2010; Glorot 
and Bengio 2010). 
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Pretraining Unrolling Fine-tuning 


Figure 28.3 Training a deep autoencoder. (a) First we greedily train some RBMs. (b) Then we construct 
the auto-encoder by replicating the weights. (c) Finally we fine-tune the weights using back-propagation. 
From Figure 1 of (Hinton and Salakhutdinov 2006). Used with kind permission of Ruslan Salakhutdinov. 


Deep auto-encoders 


An auto-encoder is a kind of unsupervised neural network that is used for dimensionality 
reduction and feature discovery. More precisely, an auto-encoder is a feedforward neural network 
that is trained to predict the input itself. To prevent the system from learning the trivial identity 
mapping, the hidden layer in the middle is usually constrained to be a narrow bottleneck. The 
system can minimize the reconstruction error by ensuring the hidden units capture the most 
relevant aspects of the data. 

Suppose the system has one hidden layer, so the model has the form v —> h —> v. Further, 
suppose all the functions are linear. In this case, one can show that the weights to the K 
hidden units will span the same subspace as the first K principal components of the data 
(Karhunen and Joutsensalo 1995; Japkowicz et al. 2000). In other words, linear auto-encoders are 
equivalent to PCA. However, by using nonlinear activation functions, one can discover nonlinear 
representations of the data. 

More powerful representations can be learned by using deep auto-encoders. Unfortunately 
training such models using back-propagation does not work well, because the gradient signal 
becomes too small as it passes back through multiple layers, and the learning algorithm often 
gets stuck in poor local minima. 

One solution to this problem is to greedily train a series of RBMs and to use these to initialize 
an auto-encoder, as illustrated in Figure 28.3. The whole system can then be fine-tuned using 
backprop in the usual fashion. This approach, first suggested in (Hinton and Salakhutdinov 
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Figure 28.4 (a) A DBN architecture for classifying MNIST digits. Source: Figure 1 of (Hinton et al. 2006). 
Used with kind permission of Geoff Hinton. (b) These are the 125 errors made by the DBN on the 10,000 
test cases of MNIST. Above each image is the estimated label. Source: Figure 6 of (Hinton et al. 2006). 
Used with kind permission of Geoff Hinton. Compare to Figure 16.15. 


2006), works much better than trying to fit the deep auto-encoder directly starting with random 
weights. 


Stacked denoising auto-encoders 


A standard way to train an auto-encoder is to ensure that the hidden layer is narrower than the 
visible layer. This prevents the model from learning the identity function. But there are other 
ways to prevent this trivial solution, which allow for the use of an over-complete representation. 
One approach is to impose sparsity constraints on the activation of the hidden units (Ranzato 
et al. 2006). Another approach is to add noise to the inputs; this is called a denoising auto- 
encoder (Vincent et al. 2010). For example, we can corrupt some of the inputs, for example 
by setting them to zero, so the model has to learn to predict the missing entries. This can be 
shown to be equivalent to a certain approximate form of maximum likelihood training (known 
as score matching) applied to an RBM (Vincent 2011). 

Of course, we can stack these models on top of each other to learn a deep stacked denoising 
auto-encoder, which can be discriminatively fine-tuned just like a feedforward neural network, 


if desired. 


Applications of deep networks 


In this section, we mention a few applications of the models we have been discussing. 


Handwritten digit classification using DBNs 


Figure 28.4(a) shows a DBN (from (Hinton et al. 2006)) consisting of 3 hidden layers. The visible 
layer corresponds to binary images of handwritten digits from the MNIST data set. In addition, 
the top RBM is connected to a softmax layer with 10 units, representing the class label. 
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Figure 28.5 2d visualization of some bag of words data from the Reuters RCV1-v2 corpus. (a) Results of 
using LSA. (b) results of using a deep auto-encoder. Source: Figure 4 of (Hinton and Salakhutdinov 2006). 
Used with kind permission of Ruslan Salakhutdinov. 


The first 2 hidden layers were trained in a greedy unsupervised fashion from 50,000 MNIST 
digits, using 30 epochs (passes over the data) and stochastic gradient descent, with the CD 
heuristic. This process took “a few hours per layer” (Hinton et al. 2006, p1540). Then the top 
layer was trained using as input the activations of the lower hidden layer, as well as the class 
labels. The corresponding generative model had a test error of about 2.5%. The network weights 
were then carefully fine-tuned on all 60,000 training images using the up-down procedure. This 
process took “about a week” (Hinton et al. 2006, p1540). The model can be used to classify 
by performing a deterministic bottom-up pass, and then computing the free energy for the 
top-level RBM for each possible class label. The final error on the test set was about 1.25%. The 
misclassified examples are shown in Figure 28.4(b). 

This was the best error rate of any method on the permutation-invariant version of MNIST 
at that time. (By permutation-invariant, we mean a method that does not exploit the fact that 
the input is an image. Generic methods work just as well on permuted versions of the input 
(see Figure 1.5), and can therefore be applied to other kinds of datasets.) The only other method 
that comes close is an SVM with a degree 9 polynomial kernel, which has achieved an error 
rate of 1.4% (Decoste and Schoelkopf 2002). By way of comparison, l-nearest neighbor (using 
all 60,000 examples) achieves 3.1% (see mnist1NNdemo). This is not as good, although 1-NN is 
much simpler.’ 


Data visualization and feature discovery using deep auto-encoders 


Deep autoencoders can learn informative features from raw data. Such features are often used 
as input to standard supervised learning methods. 
To illustrate this, consider fitting a deep auto-encoder with a 2d hidden bottleneck to some 


3. One can get much improved performance on this task by exploiting the fact that the input is an image. One way to do 
this is to create distorted versions of the input, adding small shifts and translations (see Figure 16.13 for some examples). 
Applying this trick reduced the SVM error rate to 0.56%. Similar error rates can be achieved using convolutional neural 
networks (Section 16.5.1) trained on distorted images (Simard et al. 2003) got 0.4%). However, the point of DBNs is that 
they offer a way to learn such prior knowledge, without it having to be hand-crafted. 
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Figure 28.6 Precision-recall curves for document retrieval in the Reuters RCV1-v2 corpus. Source: Figure 
3.9 of (Salakhutdinov 2009). Used with kind permission of Ruslan Salakhutdinov. 


text data. The results are shown in Figure 28.5. On the left we show the 2d embedding produced 
by LSA (Section 27.2.2), and on the right, the 2d embedding produced by the auto-encoder. It is 
clear that the low-dimensional representation created by the auto-encoder has captured a lot of 
the meaning of the documents, even though class labels were not used.‘ 

Note that various other ways of learning low-dimensional continuous embeddings of words 
have been proposed. See e.g., (Turian et al. 2010) for details. 


Information retrieval using deep auto-encoders (semantic hashing) 


In view of the sucess of RBMs for information retrieval discussed in Section 27.7.3.1, it is natural 
to wonder if deep models can do even better. In fact they can, as is shown in Figure 28.6. 
More interestingly, we can use a binary low-dimensional representation in the middle layer 
of the deep auto-encoder, rather than a continuous representation as we used above. This 
enables very fast retrieval of related documents. For example, if we use a 20-bit code, we 
can precompute the binary representation for all the documents, and then create a hash-table 
mapping codewords to documents. This approach is known as semantic hashing, since the 
binary representation of semantically similar documents will be close in Hamming distance. 
For the 402,207 test documents in Reuters RCV1-v2, this results in about 0.4 documents per 
entry in the table. At test time, we compute the codeword for the query, and then simply retrieve 
the relevant documents in constant time by looking up the contents of the relevant address in 
memory. To find other other related documents, we can compute all the codewords within a 


4. Some details. Salakhutdinov and Hinton used the Reuters RCV1-v2 data set, which consists of 804,414 newswire 
articles, manually classified into 103 topics. They represent each document by counting how many times each of the top 
2000 most frequent words occurs. They trained a deep auto-encoder with 2000-500-250-125-2 layers on half of the data. 
The 2000 visible units use a replicated softmax distribution, the 2 hidden units in the middle layer have a Gaussian 
distribution, and the remaining units have the usual Bernoulli-logistic distribution. When fine tuning the auto-encoder, 
a cross-entropy loss function (equivalent to maximum likelihood under a multinoulli distribution) was used. See (Hinton 
and Salakhutdinov 2006) for further details. 
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Figure 28.7 A small 1d convolutional RBM with two groups of hidden units, each associated with a filter 
of size 2. ht and h are two different “views” of the data in the first window, (x1, x2). The first view is 
computed using the filter w+, the second view using filter w?. Similarly, hå and A} are the views of the 
data in the second window, (a2, x3), computed using w! and w° respectively. 


Hamming distance of, say, 4. This results in retrieving about 6196 x 0.4 ~ 2500 documents’. 
The key point is that the total time is independent of the size of the corpus. 

Of course, there are other techniques for fast document retrieval, such as inverted indices. 
These rely on the fact that individual words are quite informative, so we can simply intersect all 
the documents that contain each word. However, when performing image retrieval, it is clear that 
we do not want to work at the pixel level. Recently (Krizhevsky and Hinton 2010) showed that 
a deep autoencoder could learn a good semantic hashing function that outperformed previous 
techniques (Torralba et al. 2008; Weiss et al. 2008) on the 80 million tiny images dataset. It 
is hard to apply inverted indexing techniques to real-valued data (although one could imagine 
vector quantizing image patches). 


Learning audio features using ld convolutional DBNs 


To apply DBNs to time series of unbounded length, it is necessary to use some form of parameter 
tying. One way to do this is to use convolutional DBNs (Lee et al. 2009; Desjardins and Bengio 
2008), which use convolutional RBMs as their basic unit. These models are a generative version of 
convolutional neural nets discussed in Section 16.5.1. The basic idea is illustrated in Figure 28.7. 
The hidden activation vector for each group is computed by convolving the input vector with 
that group’s filter (weight vector or matrix). In other words, each node within a hidden group 
is a weighted combination of a subset of the inputs. We compute the activaton of all the 
hidden nodes by “sliding” this weight vector over the input. This allows us to model translation 
invariance, since we use the same weights no matter where in the input vector the pattern 
occurs. Each group has its own filter, corresponding to its own pattern detector. 


5. Note that 6196 = Ekai CG) is the number of bit vectors that are up to a Hamming distance of 4 away. 


6. It is often said that the goal of deep learnng is to discover invariant features, e.g., a representation of an object 
that does not change even as nuisance variables, such as the lighting, do change. However, sometimes these so-called 
“nuisance variables” may be the variables of interest. For example if the task is to determine if a photograph was taken 
in the morning or the evening, then lighting is one of the more salient features, and object identity may be less relevant. 
As always, one task’s “signal” is another task’s “noise”, so it unwise to “throw away” apparently irrelevant information 
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More formally, for binary ld signals, we can define the full conditionals in a convolutional 
RBM as follows (Lee et al. 2009): 


p(hk =1\v) = sigm((w* v); + b) (28.6) 
p(vs =1|h) = sigm($ (w" @h*), + cs) (28.7) 
k 


where w“ is the weight vector for group k, b; and c, are bias terms, and a ® b represents the 
convolution of vectors a and b. 

It is common to add a max pooling layer as well as a convolutional layer, which computes 
a local maximum over the filtered response. This allows for a small amount of translation 
invariance. It also reduces the size of the higher levels, which speeds up computation consider- 
ably. Defining this for a neural network is simple, but defining this in a way which allows for 
information flow backwards as well as forwards is a bit more involved. The basic idea is similar 
to a noisy-OR CPD (Section 10.2.3), where we define a probabilistic relationship between the max 
node and the parts it is maxing over. See (Lee et al. 2009) for details. Note, however, that the 
top-down generative process will be difficult, since the max pooling operation throws away so 
much information. 

(Lee et al. 2009) applies 1d convolutional DBNs of depth 2 to auditory data. When the input 
consists of speech signals, the method recovers a representation that is similar to phonemes. 
When applied to music classification and speaker identification, their method outperforms tech- 
niques using standard features such as MFCC. (All features were fed into the same discriminative 
classifier.) 

In (Seide et al. 2011), a deep neural net was used in place of a GMM inside a conventional 
HMM. The use of DNNs significantly improved performance on conversational speech recogni- 
tion. In an interview, the tech lead of this project said “historically, there have been very few 


individual technologies in speech recognition that have led to improvements of this magnitude”.’ 


Learning image features using 2d convolutional DBNs 


We can extend a convolutional DBN from ld to 2d in a straightforward way (Lee et al. 2009), as 
illustrated in Figure 28.8. The results of a 3 layer system trained on four classes of visual objects 
(cars, motorbikes, faces and airplanes) from the Caltech 101 dataset are shown in Figure 28.9. 
We only show the results for layers 2 and 3, because layer 1 learns Gabor-like filters that are 
very similar to those learned by sparse coding, shown in Figure 13.21(b). We see that layer 2 has 
learned some generic visual parts that are shared amongst object classes, and layer 3 seems to 
have learned filters that look like grandmother cells, that are specific to individual object classes, 
and in some cases, to individual objects. 


Discussion 


So far, we have been discussing models inspired by low-level processing in the brain. These 
models have produced useful features for simple classification tasks. But can this pure bottom-up 


too early. 
7. Source: http: //research.microsoft.com/en-us/news/features/speechrecognition-082911.aspx. 
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Figure 28.8 A 2d convolutional RBM with max-pooling layers. The input signal is a stack of 2d images 
(e.g, color planes). Each input layer is passed through a different set of filters. Each hidden unit is 
obtained by convolving with the appropriate filter, and then summing over the input planes. The final layer 
is obtained by computing the local maximum within a small window. Source: Figure 1 of (Chen et al. 
2010) . Used with kind permission of Bo Chen. 


faces, cars, airplanes, motorbikes 


Figure 28.9 Visualization of the filters learned by a convolutional DBN in layers two and three. Source: 
Figure 3 of (Lee et al. 2009). Used with kind permission of Honglak Lee. 


approach scale to more challenging problems, such as scene interpretation or natural language 
understanding? 

To put the problem in perspective, consider the DBN for handwritten digit classification in 
Figure 28.4(a). This has about 1.6M free parameters (28 x 28 x 500 + 500 x 500 +510 x 2000 = 
1,662,000). Although this is a lot, it is tiny compared to the number of neurons in the brain. 
As Hinton says, 


This is about as many parameters as 0.002 cubic millimetres of mouse cortex, and several 
hundred networks of this complexity could fit within a single voxel of a high-resolution 
fMRI scan. This suggests that much bigger networks may be required to compete with 
human shape recognition abilities. — (Hinton et al. 2006, p1547). 


To scale up to more challenging problems, various groups are using GPUs (see e.g., (Raina 
et al. 2009)) and/or parallel computing. But perhaps a more efficient approach is to work at a 
higher level of abstraction, where inference is done in the space of objects or their parts, rather 
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than in the space of bits and pixels. That is, we want to bridge the signal-to-symbol divide, 
where by “symbol” we mean something atomic, that can be combined with other symbols in a 
compositional way. 

The question of how to convert low level signals into a more structured/ “semantic” represen- 
tation is known as the symbol grounding problem (Harnard 1990). Traditionally such symbols 
are associated with words in natural language, but it seems unlikely we can jump directly from 
low-level signals to high-level semantic concepts. Instead, what we need is an intermediate level 
of symbolic or atomic parts. 

A very simple way to create such parts from real-valued signals, such as images, is to apply 
vector quantization. This generates a set of visual words. These can then be modelled using 
some of the techniques from Chapter 27 for modeling bags of words. Such models, however, are 
still quite “shallow”. 

It is possible to define, and learn, deep models which use discrete latent parts. Here we just 
mention a few recent approaches, to give a flavor of the possibilites. (Salakhutdinov et al. 2011) 
combine RBMs with hierarchical latent Dirichlet allocation methods, trained in an unsupervised 
way. (Zhu et al. 2010) use latent and-or graphs, trained in a manner similar to a latent structural 
SVM. A similar approach, based on grammars, is described in (Girshick et al. 2011). What is 
interesting about these techniques is that they apply data-driven machine learning methods 
to rich structured/symbolic “Al-style” models. This seems like a promising future direction for 
machine learning. 


Notation 


Introduction 


It is very difficult to come up with a single, consistent notation to cover the wide variety of 
data, models and algorithms that we discuss. Furthermore, conventions differ between machine 
learning and statistics, and between different books and papers. Nevertheless, we have tried 
to be as consistent as possible. Below we summarize most of the notation used in this book, 
although individual sections may introduce new notation. Note also that the same symbol may 
have different meanings depending on the context, although we try to avoid this where possible. 


General math notation 


Symbol 


Meaning 


[z] 
[z] 
x®y 
xOy 
arb 
aVb 


v2 


argmax,, f(x) 


Floor of x, i.e., round down to nearest integer 
Ceiling of x, i.e., round up to nearest integer 
Convolution of x and y 

Hadamard (elementwise) product of x and y 
logical AND 

logical OR 

logical NOT 

Indicator function, I(x) = 1 if x is true, else I(x) = 0 
Infinity 

Tends towards, e.g., n — 00 

Proportional to, so y = ax can be written as y x x 
Absolute value 

Size (cardinality) of a set 

Factorial function 

Vector of first derivatives 

Hessian matrix of second derivatives 

Defined as 

Big-O: roughly means order of magnitude 

The real numbers 

Range (Matlab convention): 1 : n = {1,2,...,n} 
Approximately equal to 

Argmax: the value x that maximizes f 
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B(a,b) Beta function, B(a,b) = Te 
. . . ” Qk 
B(a) Multivariate beta function, TO, an) 
(s n choose k, equal to n!/(k!(n — k)!) 
ô(x) Dirac delta function, d(x) = 00 if x = 0, else 6(x) = 0 
Õij Kronecker delta, equals 1 if 1 = j, otherwise equals 0 
dx (y) Kronecker delta, equals 1 if x = y, otherwise equals 0 
exp(x) Exponential function e” 
r(x) Gamma function, r(x) = i u®—le-“du 
U(x) Digamma function, V(x) = £ logIT (x) 
X A set from which values are drawn (e.g., ¥ = R?) 


Linear algebra notation 


We use boldface lowercase to denote vectors, such as a, and boldface uppercase to denote 
matrices, such as A. Vectors are assumed to be column vectors, unless noted otherwise. 


Symbol Meaning 

A>0 A is a positive definite matrix 

tr(A) Trace of a matrix 

det (A) Determinant of matrix A 

JA] Determinant of matrix A 

AT! Inverse of a matrix 

At Pseudo-inverse of a matrix 

AT Transpose of a matrix 

aT Transpose of a vector 

diag(a) Diagonal matrix made from vector a 

diag(A) Diagonal vector extracted from matrix A 

Tor Ig Identity matrix of size d x d (ones on diagonal, zeros off) 
lorlg Vector of ones (of length d) 

0 or Og Vector of zeros (of length d) 

\|x|| = ||x||2 Euclidean or 2 norm ey z? 

ixila L norm Xs |z] 

A:j jth column of matrix 

Ai, transpose of i'th row of matrix (a column vector) 
Aij Element (i, j) of matrix A 

x®y Tensor product of x and y 


Probability notation 


We denote random and fixed scalars by lower case, random and fixed vectors by bold lower case, 
and random and fixed matrices by bold upper case. Occastionally we use non-bold upper case 
to denote scalar random variables. Also, we use p() for both discrete and continuous random 
variables. 
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Symbol Meaning 
X LY X is independent of Y 
xXx LY X is not independent of Y 
XLY|Z X is conditionally independent of Y given Z 
XLY|Z X is not conditionally independent of Y given Z 
X~p X is distributed according to distribution p 
a Parameters of a Beta or Dirichlet distribution 
cov [x] Covariance of x 
2 [X] Expected value of X 
24 [X] Expected value of X wrt distribution q 
H(X) or H(p) Entropy of distribution p(X) 
(X;Y) Mutual information between X and Y 
KL (p||q) KL divergence from distribution p to q 
L(0) Log-likelihood function 
L(0,a) Loss function for taking action a when true state of nature is 0 
À Precision (inverse variance) \ = 1/ o? 
A Precision matrix A = X7! 
mode [X] Most probable value of X 
u Mean of a scalar distribution 
H Mean of a multivariate distribution 
p(x) Probability density or mass function 
p(aly) Conditional probability density of x given y 
® cdf of standard normal 
O pdf of standard normal 
T multinomial parameter vector, Stationary distribution of Markov chain 
p Correlation coefficient 
sigm(x) Sigmoid (logistic) function, mi 
o? Variance 
x Covariance matrix 
var [2] Variance of x 
v Degrees of freedom parameter 
Z Normalization constant of a probability distribution 


Machine learning/statistics notation 


In general, we use upper case letters to denote constants, such as C, D, K, N, S, T, etc. We 
use lower case letters as dummy indexes of the appropriate range, such as c = 1 : C to index 
classes, j = 1 : D to index input features, k = 1 : K to index states or clusters, s = 1 : S to 
index samples, t = 1 : T to index time, etc. To index data cases, we use the notation i = 1 : N, 
although the notation n = 1 : N is also widely used. 

We use x to represent an observed data vector. In a supervised problem, we use y or y to 
represent the desired output label. We use z to represent a hidden variable. Sometimes we also 
use q to represent a hidden discrete variable. 
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Symbol Meaning 

C Number of classes 

D Dimensionality of data vector (number of features) 

R Number of outputs (response variables) 

D Training data D = {x;|i = 1 : N} or D = { (x; y) li = 1: N} 
Piest Test data 

J(0) Cost function 

K Number of states or dimensions of a variable (often latent) 
K(x, y) Kernel function 

K Kernel matrix 

À Strength of l> or 44 regularizer 

N Number of data cases 

Ne Number of examples of class c, Ne = Sa Ilyn =c) 
(x) Basis function expansion of feature vector x 

® Basis function expansion of design matrix X 

q( Approximate or proposal distribution 

Q(8,9o1a) Auxiliary function in EM 

S Number of samples 

I Length of a sequence 

T(D) Test statistic for data 

T Transition matrix of Markov chain 

0 Parameter vector 

gf) s'th sample of parameter vector 

ð Estimate (usually MLE or MAP) of 0 

mL Maximum likelihood estimate of 0 

Ô MAP MAP estimate of 0 

0 Estimate (usually posterior mean) of 0 

w Vector of regression weights (called 8 in statistics) 

W Matrix of regression weights 

Lis Component (i.e., feature) j of data case i, fori =1:N,j=1:D 
Xi Training case, i = 1 : N 

X Design matrix of size N x D 

X Empirical mean X = + EG Xi 

X Future test case 

X, Future test case 

y Vector of all training labels y = (y1,..., yn) 

Zij Latent component j for case i 


Graphical model notation 


In graphical models, we index nodes by s,t,u € V, and states by i, j, k € Æ. 
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Symbol Meaning 

swt Node s is connected to node t 

bel Belief function 

C Cliques of a graph 

ch; Child of node j in a DAG 

desc; Descendants of node j in a DAG 

G A graph 

E Edges of a graph 

mb; Markov blanket of node t 

nbd; Neighborhood of node t 

pa; Parents of node t in a DAG 

pred, Predecessors of node t in a DAG wrt some ordering 
w-(x-) Potential function for clique c 

S Separators of a graph 

Osjk prob. node s is in state k given its parents are in states 7 
y Nodes of a graph 
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List of commonly used abbreviations 


Abbreviation Meaning 

cdf Cumulative distribution function 
CPD Conditional probability distribution 
CPT Conditional probability table 

CRF Conditional random field 

DAG Directed acyclic graphic 

DGM Directed graphical model 

EB Empirical Bayes 

EM Expectation maximization algorithm 
EP Expectation propagation 

GLM Generalized linear model 

GMM Gaussian mixture model 

HMM Hidden Markov model 

iid Independent and identically distributed 
iff If and only if 

KL Kullback Leibler divergence 

LDS Linear dynamical system 

LHS Left hand side (of an equation) 
MAP Maximum A Posterior estimate 
MCMC Markov chain Monte Carlo 

MH Metropolis Hastings 

MLE Maximum likelihood estimate 
MPM Maximum of Posterior Marginals 
MRE Markov random field 

MSE Mean squared error 

NLL Negative log likelihood 

OLS Ordinary least squares 

pd Positive definite (matrix) 

pdf Probability density function 

pmf Probability mass function 

RBPF Rao-Blackwellised particle filter 
RHS Right hand side (of an equation) 
RJMCMC Reversible jump MCMC 

RSS Residual sum of squares 

SLDS Switching linear dynamical system 
SSE Sum of squared errors 

UGM Undirected graphical model 

VB Variational Bayes 

wrt With respect to 


Notation 
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logregLaplaceGirolamiDemo, 257, 258 
logregMultinomKernelDemo, 269 
logregSATdemo, 21 
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lossFunctionFig, 179 rdaFit, 108 
IsiCode, 419 regtreeSurfaceDemo, 545 

d rejectionSamplingDemo, 818 
relevanceNetworkNewsgroupDemo, 908 
marsDemo, 554 residualsDemo, 219 
mcAccuracyDemo, 55 ridgePathProstate, 437 


mcEstimatePi, 54 A 
, riskFnGauss, 198 
mcmcGmmDemo, 851, 860, 861 robustDemo, 40 


mcQuantileDemo, 153 ; 
meStatDist, 598 robustPriorDemo, 168 
miMixedDemo, 59 


quantileDemo, 33 


mixBerMnistEM, 341 saDemoPeaks, 869, 870 
mixBetaDemo, 170 sampleCdf, 816 ; 
mixexpDemo, 343 samplingDistGaussShrinkage, 203 
mixexpDemoOneToMany, 344 sensorFusion2d, 123 
mixGaussDemoFaithful, 353 sensorFusionUnknownPrec, 141 
mixGaussLikSurfaceDemo, 346 seqlogoDemo, 36 
mixGaussMLvsMAP, 356 shrinkageDemoBaseball, 175 
mixGaussOverRelaxedEmDemo, 369 shrinkcov, 130 
mixGaussPlotDemo, 339 shrinkcovDemo, 129. 
mixGaussSingularity, 356 shrunkenCentroidsFit, 109 
mixGaussVbDemofFaithful, 753, 755 shrunkenCentroidsSRBCTdemo, 109, 110 
mixPpcaDemoNetlab, 386 shuffledDigitsDemo, 7, 25 
mixStudentBankruptcyDemo, 361 sigmoidLowerBounds, 761 
mlpPriorsDemo, 574 sigmoidPlot, 21 
mlpRegEvidenceDemo, 579 sigmoidplot2D, 246 
mlpRegHmcDemo, 579 simpsonsParadoxGraph, 933 
mnistINNdemo, 25, 1002 sliceSamplingDemold, 865 
multilevelLinregDemo, 844 sliceSamplingDemo2d, 865 
mutualInfoAllPairsMixed, 59 smoothingKernelPlot, 507 

softmaxDemo2, 103 

SpaRSA, 445 


naiveBayesBowDemo, 84, 88 
naiveBayesFit, 83, 277 
naiveBayesPredict, 86, 277 
netflixResultsPlot, 981 
newsgroupsVisualize, 5 
newtonsMethodMinQuad, 250 
newtonsMethodNonConvex, 250 
ngramPlot, 592 

NIXdemo2, 135 
normalGammaPenaltyPlotDemo, 460 
normalGammaThresholdPlotDemo, 461 
numbersGame, 69-71 


sparseDictDemo, 471 
sparseNnetDemo, 575 
sparsePostPlot, 459 
sparseSensingDemo, 438 
spectralClusteringDemo, 893 
splineBasisDemo, 125 
ssmTimeSeriesSimple, 638, 639 
steepestDescentDemo, 247, 248 
stickBreakingDemo, 883 
studentLaplacePdfPlot, 40 
subgradientPlot, 432 

La Gaus Plot 412 
surfaceFitDemo, 218 
pagerankDemo, 600, 603 svdimageDemo, 394 


pogani Dem opm 602 svmCgammaDemo, 504 
paretoPlot, 


parzenWindowDemo2, 509 
pcaDemo2d, 388 
pcaDemo3d, 11 


tanhPlot, 570 
trueskillDemo, 798 
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trueskillPlot, 797 
unigaussVbDemo, 745 


varEMbound, 368 
variableElimination, 717 
visDirichletGui, 48 
visualizeAlarmNetwork, 314 
vqDemo, 354 


wiPlotDemo, 127 
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#P-hard, 727 ARMA, 639, 674 
0-1 loss, 177 array CGH, 454 
3-SAT, 727 association rules, 15 

associative, 931 
A star search, 887 associative Markov network, 668 
absorbing state, 598 associative memory, 568, 669, 997 
accept, 848 associative MRF, 802 
action, 176 assumed density filter, 267 
action. nodes, 328 assumed density filtering, 653, 787 
action space, 176 asymptotically normal, 194 
actions, 176 asymptotically optimal, 201 
activation, 563 asynchronous updates, 774 
active learning, 230, 234, 938 atom, 469 
Active set, 441 atomic bomb, 52 
active set, 442 attractive MRF, 802 
Activity recognition, 605 attributes, 2, 3 
Adaboost.M1, 559 AUC, 181 
adagrad, 263 audio-visual speech recognition, 628 
adaline, 569 augmented DAG, 932 
adaptive basis-function model, 543 auto-encoder, 1000 
adaptive importance sampling, 821 auto-encoders, 990 
adaptive lasso, 460 auto-regressive HMM, 626 
adaptive MCMC, 853 autoclass, l , 
adaptive rejection Metropolis sampling, 820 autocorrelation function, 862 
adaptive rejection sampling, 820 automatic relevance determination, 463 
add-one smoothing, 77, 593 automatic relevancy determination, 238, 398, 580, 747 
ADE, 653, 983 Automatic speech recognition, 605 
adjacency matrix, 309, 970 automatic speech recognition, 624 
adjust for, 934 auxiliary function, 350 
adjusted Rand index, 878 auxiliary variables, 863, 868 
admissible, 197 , average link clustering, 897 
admixture mixture, 950 average precision, 303 
AdSense, 928 average precision at K, 183 
AdWords, 928 axis aligned, 47 
affinity propagation, 887 axis parallel splits, 544 
agglomerative clustering, 893 
agglomerative hierarchical clustering, 927 back-propagation, 999 
aha, 68 backdoor path, 934 
Al, 1007 backfitting, 552, 563, 998 
AIC, 162, 557 background knowledge, 68 
Akaike information criterion, 162 backoff smoothing, 594 
alarm network, 313 backpropagation, 570, 970 
alignment, 701 backpropagation algorithm, 569 
all pairs, 503 backslash operator, 228 
alleles, 317 Backwards selection, 428 
alpha divergence, 735 bag of words, 5, 81, 945 
alpha expansion, 803 bag-of-characters, 483 
alpha-beta swap, 804 bag-of-words, 483 
alternative hypothesis, 163 bagging, 551 
analysis view, 390 bandwidth, 480, 507 
analysis-synthesis, 470 barren node removal, 334, 714 
ancestors, 309 BART, 551, 586 
ancestral graph, 664 Barzilai-Borwein, 445 
ancestral sampling, 822 base distribution, 338 
and-or graphs, 1007 base learner, 554 
annealed importance sampling, 871, 923 base measure, 882 
annealing, 853 base rate fallacy, 30 
annealing importance sampling, 992 basic feasible solution, 468 
ANOVA, 553 basis function expansion, 20, 217 
anti-ferromagnets, 668 basis functions, 421 
aperiodic, 598 basis pursuit denoising, 430 
approximate inference, 727 batch, 261 
approximation error, 230 Baum-Welch, 618 
ARD, 238, 463, 520, 580 Bayes ball algorithm, 324 
ARD kernel, 480 Bayes decision rule, 177, 195 


area under the curve, 181 Bayes estimator, 177, 195 
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Bayes factor, 137, 163, 921 
Bayes model averaging, 71, 581 
Bayes point, 257 

Bayes risk, 195 

Bayes rule, 29, 340 

Bayes Theorem, 29 

Bayesian, xxvii, 27 

Bayesian adaptive regression trees, 551 
Bayesian factor regression, 405 
Bayesian hierarchical clustering, 899 
Bayesian information criterion, 161 
Bayesian IPF, 683 

Bayesian lasso, 448 

Bayesian model selection, 156 
Bayesian network structure learning, 914 
Bayesian networks, 310 
Bayesian Occam’s razor, 156 
Bayesian statistics, 149, 191 
BDe, 917 

BDeu, 918 

beam search, 428, 887 

belief networks, 310 

belief propagation, 611, 707, 767 
belief state, 71, 332, 607, 609 
belief state MDP, 332 

belief updating, 709 

bell curve, 20, 38 

Berkson’s paradox, 326 
Bernoulli, 21, 34 

Bernoulli product model, 88 
Bernoulli-Gaussian, 426 

Bessel function, 483 

beta distribution, 42, 74 

beta function, 42 

beta process, 470 
beta-binomial, 78 

Bethe, 781 

Bethe energy functional, 781 
Bethe free energy, 781 

BFGS, 251 

Bhattacharya distance, 828 
bi-directed graph, 674 

bias, 20, 200, 457 

bias term, 669 

bias-variance tradeoff, 202 
BIC, 161, 256, 557, 920 
biclustering, 903 

big data, 1 

bigram model, 591 

binary classification, 3, 65 
binary entropy function, 57 
binary independence model, 88 
binary mask, 426, 470 

binary tree, 895 

Bing, 302, 799, 983 

binomial, 34 

binomial coefficient, 34 
binomial distribution, 74 
binomial regression, 292 
BinomialBoost, 561 

BIO, 687 

biosequence analysis, 36, 170 
bipartite graph, 313 

biplot, 383 

birth moves, 370 

bisecting K-means, 898 

bits, 56 

bits-back, 733 

black swan paradox, 77, 84 
black-box, 340, 585 


Blackwell-MacQueen, 884 
blank slate, 165 

blind signal separation, 407 
blind source separation, 407 
blocked Gibbs sampling, 848 
blocking Gibbs sampling, 848 
bloodtype, 317 

BN20, 315 

bolasso, 439 


Boltzmann distribution, 104, 869 
Boltzmann machine, 568, 669, 983 


bond variables, 866 
Boosting, 554 

boosting, 553, 742 
bootstrap, 192 

bootstrap filter, 827 
bootstrap lasso, 439 
bootstrap resampling, 439 


borrow statistical strength, 171, 231, 296, 845 


bottleneck, 205, 337, 1000 
bottleneck layer, 970 
bound optimization, 369 
box constraints, 444 
Box-Muller, 817 

boxcar kernel, 508, 508 
Boyen-Koller, 654 

BP, 707 

BPDN, 430 

Bradley Terry, 795 
branch and bound, 811 
branching factor, 954 
bridge regression, 458 
Brownian motion, 483 
bucket elimination, 715 
BUGS, 756, 847 

Buried Markov models, 627 
burn-in phase, 856 
burned in, 838 
burstiness, 88 

bursty, 480 


C4.5, 545 

calculus of variations, 289 
calibration, 724 
Candidate method, 872 


Canonical correlation analysis, 407 


canonical form, 282 
canonical link function, 291 


canonical parameters, 115, 282 


Cardinality constraints, 810 
CART, 544, 545 

Cartesian, 51 

cascade, 776 

case analysis, 260 

categorical, 2, 35 

categorical PCA, 402, 947, 961 
categorical variables, 876 
Cauchy, 40 


causal Markov assumption, 931 


Causal models, 931 
causal MRF, 661 

causal networks, 310 
causal sufficiency, 931 
causality, 919, 929 

CCA, 407 

CCCP, 702 

CD, 989 

cdf, 32, 38 

Censored regression, 379 
censored regression, 380 
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centering matrix, 494 

central composite design, 523 
central interval, 152 

central limit theorem, 38, 51, 255 
central moment, 413 
central-limit theorem, 55 
centroid, 341 

centroids, 486 

certainty factors, 675 


chain graph, 671 

chain rule, 29, 307 

chance nodes, 328 

change of variables, 50 

channel coding, 56 
Chapman-Kolmogorov, 590 
characteristic length scale, 480 
Cheeseman-Stutz approximation, 923 
Chi-squared distribution, 42 
chi-squared statistic, 163, 213 
children, 309, 310 

Chinese restaurant process, 884 
chip-Seq, 622 

Cholesky decomposition, 227, 817 
C 


homsky normal form, 689 


chordal, 665 

chordal graph, 720 

Chow-Liu algorithm, 312, 912 
CI, 308 

circuit complexity, 944 

city block distance, 876 
clamped phase, 987 

clamped term, 677 

clamping, 319 

class imbalance, 503 
class-conditional density, 30, 65 
classical, 149 

classical statistics, 191 

cl 
Classification and regression trees, 544 
clausal form, 675 


assification, 2, 3 


cliques, 719, 722 
closing the loop, 635 
cl 
cluster variational method, 783 
Clustering, 875 

clustering, 10, 340 

cl 
clutter problem, 788 


osure, 662 


usters, 487 


co-clustering, 979 
co-occurrence matrix, 5 
co-parents, 327 

coarse-to-fine grid, 775 
cocktail party problem, 407 
coclustering, 903 

codebook, 354 

collaborative filtering, 14, 300, 387, 903, 979 
collapsed Gibbs sampler, 841 
collapsed Gibbs sampling, 956 
collapsed particles, 831 

collect evidence, 707 
collect-to-root, 723 

collider, 324 


C 


OLT, 210 


committee method, 580 
commutative semi-ring, 717 
commutative semiring, 726 
compactness, 897 


compelled edges, 915 
complementary prior, 997 
complete, 322 

complete data, 270, 349 

complete data assumption, 914 
complete data log likelihood, 348, 350 
complete link clustering, 897 
completing the square, 143 
composite likelihood, 678 
compressed sensing, 472 
compressive sensing, 472 
computation tree, 772 
computational learning theory, 210 
computationalism, 569 

concave, 222, 286 

concave-convex procedure, 702 
concentration matrix, 46 
concentration parameter, 882 
concept, 65 

concept learning, 65 
condensation, 827 

conditional entropy, 59 
conditional Gamma Poisson, 949 
conditional Gaussian, 920 
conditional independence, 308 
conditional likelihood, 620 
conditional logit model, 252 
conditional probability, 29 
conditional probability distribution, 308 
conditional probability tables, 308 
conditional random field, 684 
conditional random fields, 606, 661 
conditional topic random field, 969 
conditionally conjugate, 132 
conditionally independent, 31, 82 
conditioning, 319 

conditioning case, 322 
conductance, 858 

confidence interval, 212 
confidence intervals, 153 
confounder, 674 

confounders, 931 

confounding variable, 934 
confusion matrix, 181 

conjoint analysis, 297 

conjugate gradients, 249, 524 
conjugate prior, 74 

conjugate priors, 281, 287 
conjunctive normal form, 675 
connectionism, 569 

consensus sequence, 36, 606 
conservation of probability mass, 157 
consistent, 200 

consistent estimator, 233 
consistent estimators, 70 

constant symbols, 676 

constraint satisfaction problems, 717, 726 
constraint-based approach, 924 
content ede abe memory, 669 
context free grammar, 689 

context specific independence, 321 
context-specific independence, 944 
contextual bandit, 184, 254 
contingency table, 682 
continuation method, 442, 869 
contrastive divergence, 569, 989 
contrastive term, 677 

control signal, 625, 631 

converge, 857 

convex, 58, 221, 247, 285, 677 
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convex belief propagation, 785, 943 
convex combination, 76, 130, 338 
convex hull, 777 

convolutional DBNs, 1004 
convolutional neural nets, 1004 
convolutional neural network, 565 
cooling schedule, 870 

corpus, 953 

correlated topic model, 757, 961 
correlation coefficient, 45, 876 
correlation matrix, 45 
correspondence, 658 

cosine similarity, 480 
cost-benefit analysis, 186 
coupled HMM, 628 

covariance, 44 

covariance graph, 674, 908 
covariance matrix, 45, 46 
covariance selection, 938 
covariates, 2 

CPD, 308 

CPTs, 308 

Cramer-Rao inequality, 201 
Cramer-Rao lower bound, 201 
credible interval, 137, 152, 212 
CRF, 661, 684 

critical temperature, 868 

critical value, 671 

cross entropy, 57, 571 

cross over rate, 181 

cross validation, 24, 206 
cross-entropy, 246, 953 
cross-language information retrieval, 963 
crosscat, 904 

crowd sourcing, 10, 995 

CRP, 884 

CTR, 4 

cubic spline, 537 

cumulant function, 282, 284 
cumulants, 284 

cumulative distribution function, 32, 38 
curse of dimensionality, 18, 487 
curved exponential family, 282 
cutting plane, 698 

CV, 24 

cycle, 310 

cyclic permutation property, 99 


d-prime, 106 

d-separated, 324 

DACE, 518 

DAG, 310 

damped updates, 739 
damping, 773 

Dasher, 591 

data association, 658, 810 
data augmentation, 362, 847 
data compression, 56 

data fragmentation, 546 
data fusion, 404 

data overwhelms the prior, 69 
data-driven MCMC, 853 
data-driven proposals, 828 
DBM, 996 

DBN, 628, 997 

DCM, 89 

DCT, 469 

death moves, 370 
debiasing, 439 

decision, 176 


decision boundary, 22 
decision diagram, 328 
decision nodes, 328 
decision problem, 176 
decision procedure, 177 
decision rule, 22 

decision trees, 544 
decoding, 693 
decomposable, 665, 722, 941 
decomposable graphs, 682 
decomposes, 322, 917 
DeeBN, 628 

DeeBNs, 997 

deep, 929 

deep auto-encoders, 1000 
deep belief network, 997 
deep Boltzmann machine, 996 
deep directed networks, 996 
deep learning, 479, 995 
deep networks, 569 
defender’s fallacy, 61 
deflated matrix, 418 
degeneracy problem, 825 
degenerate, 532, 535 

degree, 310 


degrees of freedom, 39, 161, 206, 229, 534 


deleted interpolation, 593 

delta rule, 265 

dendrogram, 895 

denoising auto-encoder, 1001 
dense stereo reconstruction, 690 
density estimation, 9 

dependency network, 909 
dependency networks, 679 
derivative free filter, 651 
descendants, 309 

descriptive, 2 

design matrix, 3, 875 

detailed balance, 854 

detailed balance equations, 599 
determinism, 944 

deterministic annealing, 367, 620 
deviance, 547 

DGM, 310 

diagonal, 46 

diagonal covariance LDA, 107 
diagonal LDA, 108 

diameter, 710, 897 

dictionary, 469 

digamma, 361, 752, 958 

digital cameras, 8 

dimensionality reduction, 11, 1000 
Dirac delta function, 39 

Dirac measure, 37, 68 

Dirchlet process, 903 

direct posterior probability approach, 184 
directed, 309 

directed acyclic graph, 310 

directed graphical model, 310 
directed local Markov property, 327 
directed mixed graph, 929 

directed mixed graphical model, 674 
Dirichlet, 79 

Dirichlet Compound Multinomial, 89 
Dirichlet distribution, 47 


Dirichlet multinomial regression LDA, 969 


Dirichlet process, 596, 879, 882, 973, 976 


Dirichlet process mixture models, 508, 755 


discontinuity preserving, 691 
discounted cumulative gain, 303 
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discrete, 35 eigenfaces, 12 
discrete AdaBoost, 559 eigengap, 857 
discrete choice modeling, 296 eigenvalue spectrum, 130 
discrete random variable, 28 EKF, 648 
discrete with probability one, 884 elastic net, 438, 456, 936 
discretize, 59, 691 elimination order, 718 
discriminability, 106 EM, 271, 349, 618, 749 
discriminant analysis, 101 email spam filtering, 5 
discriminant function, 500 embedding, 575 
discriminative, 245 empirical Bayes, 157, 162, 173, 300, 746 
discriminative classifier, 30 empirical distribution, 37, 205 
discriminative LDA, 968 empirical measure, 37 
discriminative random field, 684 empirical risk, 205, 697 
disease mapping, 531 empirical risk minimization, 205, 261 
disease transmission, 970 end effector, 344 
disparity, 691 energy based models, 666 
dispersion parameter, 290 energy function, 255 
dissimilarity analysis, 898 energy functional, 732, 778 
dissimilarity matrix, 875 ensemble, 980 
distance matrix, 875 Ensemble learning, 580 
distance transform, 775 ensemble learning, 742 
distorted, 566 entanglement, 629 
distortion, 354 entanglement problem, 635, 653 
distribute evidence, 707 Entropy, 547 
distribute-from-root, 724 entropy, 56 
distributed encoding, 984 EP, 983 
distributed representation, 569, 627 Epanechnikov kernel, 508 
distributional particles, 831 ePCA, 947 
distributive law, 717 epigraph, 222 
divisive clustering, 893 epistemological uncertainty, 973 
DNA sequences, 36 epoch, 264, 566 
do calculus, 932 epsilon insensitive loss function, 497 
Document classification, 87 EPSR, 859 
document classification, 5 equal error rate, 181 
Domain adaptation, 297 equilibrium distribution, 597 
domain adaptation, 297 equivalence class, 915 
dominates, 197 equivalent kernel, 512, 533 
double loop algorithms, 773 equivalent sample size, 76, 917 
double Pareto distribution, 461 ork 38 
double sided exponential, 41 ergodic, 599 
dRUM, 294 Erlang distribution, 42 
dual decomposition, 808 ERM, 205, 261 
dual variables, 492, 499 error bar, 76 
dummy encoding, 35 error correcting codes, 768 
dyadic, 976 error correction, 56 
DyBN, 628 error function, 38 
DyBNs, 997 error signal, 265 
dynamic Bayes net, 653 error-correcting output codes, 503, 581 
dynamic Bayesian network, 628 ESS, 862 
dynamic linear model, 636 essential graph, 915 
dynamic programming, 331, 920 estimated potential scale reduction, 859 
dynamic topic model, 962 estimator, 191 
oe distance, 18 
evidence, 156, 173 

pees ll evidence procedure, 173, 238, 746 
early stopping, 263, 557, 572 evolutionary MCMC, 429 
EB, 173 exchangeable, 321, 963 
ECM, 369, 387 exclusive or, 486 
ECME, 369 expectation correction, 658 
ECOC, 581 expectation maximization, 349 
econometric forecasting, 660 expectation proagation, 735 
economy sized SVD, 392 ad dace propaga toni B 

iJi expectation propagation, 
A probability, 786 expected complete data log likelihood, 350, 351 
edit distance, 479 expected profit, 330 
EER, 181 expected sufficient statistics, 350, 359, 619 
effective sample size, 75, 825, 862 expected value, 33 
efficient IPF, 683 explaining away, 326 
efficiently PAC-learnable, 210 explicit duration HMM, 622 


eigendecomposition, 98 exploration-exploitation, 184 
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exploratory data analysis, 7 
exponential cooling schedule, 870 
Exponential distribution, 42 
exponential family, 115, 253, 281, 282, 290, 347 
exponential family harmonium, 985 
exponential family PCA, 947 
exponential loss, 556 

exponential power distribution, 458 
extended Kalman filter, 648 
extension, 67 

external field, 668 


F score, 183 

Fl score, 183, 699 

FA, 381 

face detection, 8 

face detector, 555 

face recognition, 8 

Facebook, 974 

factor, 665 

factor analysis, 381, 402, 931, 947 
factor analysis distance, 520 
factor graph, 769, 769, 771, 888 
factor loading matrix, 381 
factorial HMM, 628 

factorial prior, 463 

factors, 382 

faithful, 936 

false alarm, 30, 180 

false alarm rate, 181 

false discovery rate, 184 

false negative, 180 

false positive, 30, 180 

false positive rate, 181 

family, 309 

family marginal, 359 

fan-in, 313 

fantasy data, 990 

farthest point clustering, 355 
fast Fourier transform, 717, 775 
fast Gauss transform, 524 

fast ICA, 411 

fast iterative shrinkage thesholding algorithm, 446 
FastSLAM, 635, 835 

fat hand, 933 

fault diagnosis, 659 

feature construction, 564 
feature extraction, 6, 564 
feature function, 667 

feature induction, 680 

feature maps, 565 

feature matrix, 875 

feature selection, 86 
feature-based clustering, 875 
features, 2, 3, 412 

feedback loops, 929 
feedforward neural network, 563 
ferro-magnets, 668 

FFT, 775 

fields of experts, 473 

fill-in edges, 719 

Filtering, 607 

filtering, 87 

finite difference matrix, 113 
finite mixture model, 879 
first-order logic, 674 

Fisher information, 166 

Fisher information matrix, 152, 193, 293 
Fisher kernel, 485 

Fisher scoring method, 293 


Fisher's linear discriminant analysis, 271 


FISTA, 446 

fit-predict cycle, 206 
fixed effect, 298 

Fixed lag smoothing, 608 
fixed point, 139 

flat clustering, 875 

FLDA, 271 

flow cytometry, 936 
folds, 24 

forest, 310, 912 


forward stagewise additive modeling, 557 
forward stagewise linear regression, 562 


forwards KL, 733 
forwards model, 345 
forwards selection, 428 


forwards-backwards, 644, 688, 707, 720 
forwards-backwards algorithm, 428, 611 


founder model, 317 
founder variables, 385 
Fourier basis, 472 

fraction of variance explained, 400 
free energy, 988 

free-form optimization, 737 
frequent itemset mining, 15 
frequentist, 27, 149 
frequentist statistics, 191 
Frobenius norm, 388 
frustrated, 868 

frustrated system, 668 


full conditional, 328, 838 
function approximation, 3 
functional data analysis, 124 
functional gradient descent, 561 
furthest neighbor clustering, 897 
fused lasso, 454 

fuzzy clustering, 973 

fuzzy set theory, 65 


g-prior, 236, 425 

game against nature, 176 

game theory, 176 

Gamma, 623 

gamma distribution, 41 

gamma function, 42 

GaP, 949 

gap statistic, 372 

gating function, 342 
Gauss-Seidel, 710 

Gaussian, 20, 38 

Gaussian approximation, 255, 731 
Gaussian Bayes net, 318 
Gaussian copulas, 942 
Gaussian graphical models, 725 
Gaussian eel 480, 507, 517 
Gaussian mixture model, 339 
Gaussian MRF, 672 


Gaussian process, 483, 505, 509, 512, 882 


Gaussian processes, 515 
Gaussian random fields, 938 
Gaussian RBM, 986 


Gaussian scale mixture, 359, 447, 505 


Gaussian sum filter, 656 

GDA, 101 

GEE, 300 

GEM, 369 

Gene finding, 606 

gene finding, 622 

gene knockout experiment, 931 
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gene microarrays, 421 Gumbel, 295 

generalization, 3 

generalization error, 23, 180 Hadamard product, 609 
generalization gradient, 66 Haldane prior, 166 

generalize, 3 ham, 5 

generalized additive model, 552 Hamiltonian MCMC, 868 
generalized belief propagation, 785 Hammersley-Clifford, 666 
generalized cross validation, 207 hamming distance, 876 
generalized eigenvalue, 274 handwriting recognition, 7 
generalized EM, 361, 369 haplotype, 317 

generalized estimating equations, 300 hard clustering, 340 

generalized linear mixed effects model, 298 hard EM, 352 

generalized linear model, 281, 290 hard thresholding, 434, 435 
generalized linear models, 281 harmonic mean, 183 
generalized pseudo Bayes filter, 657 harmonium, 983 

generalized t distribution, 461 Hastings correction, 849 
generate and test, 853 hat matrix, 221 

generative aporeees 245 HDI, 154 

generative classifier, 30 heat bath, 838 

generative pre-training, 999 heavy ball method, 249 
generative weights, 410, 986 heavy tails, 43, 223 

genetic algorithms, 348, 720, 921 Hellinger distance, 735 

genetic linkage analysis, 315, 318 Helmholtz free energy, 733 
genome, 318 Hessian, 193, 852 

genotype, 317 heteroscedastic LDA, 275 
geometric distribution, 622 heuristics, 727 

Gibbs distribution, 290, 666 hidden, 10, 349 

Gibbs sampler, 672 hidden layer, 563 

Gibbs sampling, 328, 669, 736, 838 hidden Markov model, 312, 603, 963 
Gini index, 548 hidden nodes, 313 

gist, 963 hidden semi-Markov model, 622 
Gittins Indices, 184 hidden units, 564 

Glasso, 940 hidden variable, 312, 924 
Glauber dynamics, 838 hidden variables, 319, 914 

GLM, 290, 654 hierarchical adaptive lasso, 458 
GLMM, 298 hierarchical Bayesian model, 171 
glmnet, 442 hierarchical Bayesian models, 347 
global balance equations, 597 hierarchical clustering, 875, 893 
global convergence, 248 hierarchical Dirichlet process, 621 
global localization, 828 hierarchical HMM, 624 

global Markov property, 661 hierarchical latent class model, 926 
global minimum, 222 hierarchical mixture of experts, 344, 551 
global prior parameter independence, 916 high throughput, 184, 421 
globally normalized, 686 high variance estimators, 550 
GM, 308 highest density interval, 154 
GMM, 339 highest posterior density, 153 
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